ZeRO-Offload

What is ZeRO-Offload?

ZeRO-Offload is a method for distributed training where data is split between multiple GPUs and CPUs. It is called a sharded data parallel method because it exploits both CPU memory and compute for offloading. This efficient method offers a clear path towards efficiently scaling on multiple GPUs by working with ZeRO-powered data parallelism.

How ZeRO-Offload Works

ZeRO-Offload maintains a single copy of the optimizer states on the CPU memory regardless of the data parallel degree. Furthermore, it keeps the aggregate communication volume between GPU and CPU, as well as the aggregate CPU computation a constant regardless of data parallelism. This allows ZeRO-Offload to effectively utilize the linear increase in CPU compute with the increase in the data parallelism degree.

The Benefits of ZeRO-Offload

Using ZeRO-Offload offers several benefits. For one, it allows for efficient scaling on multiple GPUs. It also keeps the optimizer states on the CPU memory, which saves GPU memory space. This leads to faster and more efficient training with less memory usage.

The symbiosis between ZeRO-Offload and ZeRO-powered data parallelism allows for effective utilization of both CPU memory and compute. This offers researchers and data scientists an accurate, efficient, and scalable way to train models.