ZeRO++: Extremely Efficient Collective Communication for Large Model Training
- Guanhua Wang ,
- Heyang Qin ,
- Sam Ade Jacobs ,
- Xiaoxia Wu ,
- Connor Holmes ,
- Zhewei Yao ,
- Samyam Rajbhandari ,
- Olatunji Ruwase ,
- Feng Yang ,
- Lei Yang ,
- Yuxiong He
ICLR 2024 |
While the Zero Redundancy Optimizer (ZeRO) excels in training large-scale models, it struggles to achieve good throughput in environments with limited band-width or small batches where communication becomes a major bottleneck. Inspired by the principles of fine-grained quantization in machine learning algorithms, we designed ZeRO++, an optimizer robust to quantization effects that allows for significant communication volume reduction using low-precision quantization techniques. ZeRO++ composes of three communication volume reduction techniques (low-precision all-gather, data remapping, and low-precision gradient averaging) to significantly reduce the communication volume up to 4x that enables up to 2.16x better throughput at 384 GPU scale. Our results also show ZeRO++ can speedup the RLHF by 3.3x compared to vanilla ZeRO. To verify the convergence of ZeRO++, we test up to 13B model for pretraining with 8/6-bits all gather and up to 30B model for finetuning with 4-bit or 2-bit all gather, and demonstrate on-par accuracy as original ZeRO (aka standard training). As a byproduct, the model trained with ZeRO++ is weight-quantized, which can be directly used for inference without post-training quantization or quantization-aware training.