{"id":658659,"date":"2020-05-19T08:00:19","date_gmt":"2020-05-19T15:00:19","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=658659"},"modified":"2020-06-08T13:24:24","modified_gmt":"2020-06-08T20:24:24","slug":"zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale\/","title":{"rendered":"ZeRO-2 & DeepSpeed: Shattering barriers of deep learning speed & scale"},"content":{"rendered":"
<\/p>\n
In February, we announced DeepSpeed, an open-source deep learning training optimization library, and ZeRO (Zero Redundancy Optimizer), a novel memory optimization technology in the library, which vastly advances large model training by improving scale, speed, cost, and usability. DeepSpeed has enabled researchers to create Turing Natural Language Generation (Turing-NLG (opens in new tab)<\/span><\/a>), the largest publicly known language model at 17 billion parameters. From there, we have been continuing to innovate at a fast rate, pushing the boundaries of speed and scale for deep learning training. Today, we are happy to share our new findings and results as we introduce the improved ZeRO-2 and further developments with DeepSpeed:<\/p>\n All of these exciting new optimizations are now available in our open-source library, DeepSpeed (opens in new tab)<\/span><\/a>. This work is an important part of Microsoft\u2019s new AI at Scale (opens in new tab)<\/span><\/a> initiative to enable next-generation AI capabilities at scale. This accompanying AI blog post (opens in new tab)<\/span><\/a> sheds light on how DeepSpeed is changing the game in big ways for large-scale AI since its release just a few months ago.<\/p>\n The Zero Redundancy Optimizer (abbreviated ZeRO) is a novel memory optimization technology for large-scale distributed deep learning. Unlike existing technologies like data parallelism (that is efficient but can only support a limited model size) or model parallelism (that can support larger model sizes but requires significant code refactoring while adding communication overhead that limits efficiency), ZeRO allows fitting larger models in memory without requiring code refactoring while remaining very efficient. ZeRO does so by eliminating the memory redundancy that is inherent in data parallelism while limiting the communication overhead to a minimum.<\/p>\n ZeRO removes the memory redundancies across data-parallel processes by partitioning the three model states (optimizer states, gradients, and parameters) across data-parallel processes instead of replicating them. By doing this, it boosts memory efficiency compared to classic data-parallelism while retaining its computational granularity and communication efficiency.<\/p>\n In our February release of DeepSpeed, we included optimizations to reduce optimizer state memory (ZeRO-1). Today, we release ZeRO-2, which extends ZeRO-1 by including optimization to reduce gradient memory, while also adding optimizations that target activation memory and fragmented memory. Compared with ZeRO-1, ZeRO-2 doubles the model size that can be trained with DeepSpeed while significantly improving the training efficiency. With ZeRO-2, a 100-billion-parameter model can be trained 10x faster than with the state-of-art technology based on model parallelism alone.<\/p>\n ZeRO-2 optimizes the full spectrum of memory consumption during deep learning training, which includes model state (such as optimizer states and gradients), activation memory, and fragmented memory. Figure 1 shows the key techniques in ZeRO-2, and the details are below.<\/p>\n\n
\n
ZeRO-2: Training models with 100 billion parameters up to 10x faster<\/h3>\n
ZeRO-2 deep dive: Reducing gradients, activation, and fragmented memory<\/h3>\n