{"id":766675,"date":"2021-08-18T09:59:54","date_gmt":"2021-08-18T16:59:54","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=766675"},"modified":"2021-10-06T16:11:15","modified_gmt":"2021-10-06T23:11:15","slug":"deepspeed-powers-8x-larger-moe-model-training-with-high-performance","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/deepspeed-powers-8x-larger-moe-model-training-with-high-performance\/","title":{"rendered":"DeepSpeed powers 8x larger MoE model training with high performance"},"content":{"rendered":"\n
\"Graphs<\/figure>\n\n\n\n

Today, we are proud to announce DeepSpeed MoE, a high-performance system that supports massive scale mixture of experts (MoE) models as part of the DeepSpeed (opens in new tab)<\/span><\/a> optimization library. MoE models are an emerging class of sparsely activated models that have sublinear compute costs with respect to their parameters. For example, the Switch Transformer consists of 1.6 trillion parameters, while the compute required to train it is approximately equal to that of a 10 billion-parameter dense model. This increase in model size offers tremendous accuracy gains for a constant compute budget.<\/p>\n\n\n\n

However, supporting these MoE models with trillions of parameters requires a complex combination of multiple forms of parallelism that is simply not available in current MoE systems. DeepSpeed MoE overcomes these challenges through a symphony of multidimensional parallelism and heterogenous memory technologies, such as Zero Redundancy Optimizer (ZeRO)<\/a> and ZeRO-Offload<\/a>, harmoniously coming together to support massive MoE models\u2014even on limited GPU resources\u2014achieving efficiency, scalability, and ease-of-use. It enables 3.5 trillion-parameter models on 512 GPUs, 8x larger than existing work, while achieving 100 teraflops (TFLOPS) per GPU and attaining near-linear scalability with respect to the number of GPUs.<\/p>\n\n\n\n

\n
\n
\n\t