{"id":783490,"date":"2021-10-11T06:00:00","date_gmt":"2021-10-11T13:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=783490"},"modified":"2021-10-11T05:52:10","modified_gmt":"2021-10-11T12:52:10","slug":"using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model\/","title":{"rendered":"Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World\u2019s Largest and Most Powerful Generative Language Model"},"content":{"rendered":"\n
We are excited to introduce the DeepSpeed- and Megatron-powered Megatron-Turing Natural Language Generation model (MT-NLG), the largest and the most powerful monolithic transformer language model trained to date, with 530 billion parameters. It is the result of a research collaboration between Microsoft and NVIDIA to further parallelize and optimize the training of very large AI models.<\/p>\n\n\n\n
As the successor to Turing NLG 17B (opens in new tab)<\/span><\/a> and Megatron-LM (opens in new tab)<\/span><\/a>, MT-NLG has 3x the number of parameters compared to the existing largest model of this type and demonstrates unmatched accuracy in a broad set of natural language tasks such as:<\/p>\n\n\n\n The 105-layer, transformer-based MT-NLG improved upon the prior state-of-the-art models in zero-, one-, and few-shot settings and set the new standard for large-scale language models in both model scale and quality.<\/p>\n\n\n\n Transformer-based language models in natural language processing (NLP) have driven rapid progress in recent years fueled by computation at scale, large datasets, and advanced algorithms and software to train these models.<\/p>\n\n\n\n Language models with large numbers of parameters, more data, and more training time acquire a richer, more nuanced understanding of language. As a result, they generalize well as effective zero\u2013 or few-shot learners, with high accuracy on many NLP tasks and datasets. Exciting downstream applications include summarization, automatic dialogue generation, translation, semantic search, and code autocompletion. It\u2019s no surprise that the number of parameters in state-of-the-art NLP models have grown at an exponential rate (Figure 1).<\/p>\n\n\n\n Training such models, however, is challenging for two main reasons:<\/p>\n\n\n\n Training MT-NLG was made feasible by numerous innovations and breakthroughs along all AI axes. For example, working closely together, NVIDIA and Microsoft achieved an unprecedented training efficiency by converging a state-of-the-art GPU-accelerated training infrastructure with a cutting-edge distributed learning software stack. We built high-quality, natural language training corpora with hundreds of billions of tokens, and co-developed training recipes to improve optimization efficiency and stability.<\/p>\n\n\n\n In this post, we elaborate on each aspect of the training and describe our methods as well as results.<\/p>\n\n\n\n Powered by NVIDIA A100 Tensor Core GPUs and HDR InfiniBand networking, state-of-the-art supercomputing clusters such as the NVIDIA Selene (opens in new tab)<\/span><\/a> and Microsoft Azure NDv4 (opens in new tab)<\/span><\/a> have enough compute power to train models with trillions of parameters within a reasonable timeframe. However, achieving the full potential of these supercomputers requires parallelism across thousands of GPUs, efficient and scalable on both memory and compute. <\/p>\n\n\n\n In isolation, existing parallelism strategies such as data, pipeline, or tensor-slicing have trade-offs in memory and compute efficiency and cannot be used to train models at this scale.<\/p>\n\n\n\n Through a collaboration between NVIDIA Megatron-LM (opens in new tab)<\/span><\/a> and Microsoft DeepSpeed (opens in new tab)<\/span><\/a>, we created an efficient and scalable 3D parallel system capable of combining data, pipeline, and tensor-slicing based parallelism together to address these challenges.<\/p>\n\n\n\n By combining tensor-slicing and pipeline parallelism, we can operate them within the regime where they are most effective. More specifically, the system uses tensor-slicing from Megatron-LM to scale the model within a node and uses pipeline parallelism from DeepSpeed to scale the model across nodes.<\/p>\n\n\n\n For example, for the 530 billion model, each model replica spans 280 NVIDIA A100 GPUs, with 8-way tensor-slicing within a node and 35-way pipeline parallelism across nodes. We then use data parallelism from DeepSpeed to scale out further to thousands of GPUs.<\/p>\n\n\n\n Model training is done with mixed precision on the NVIDIA DGX SuperPOD-based Selene (opens in new tab)<\/span><\/a> supercomputer powered by 560 DGX A100 servers networked with HDR InfiniBand in a full fat tree configuration. Each DGX A100 has eight NVIDIA A100 80GB Tensor Core GPUs, fully connected to each other by NVLink and NVSwitch (opens in new tab)<\/span><\/a>. A similar reference architecture is used by Microsoft for Azure NDv4 cloud supercomputers.<\/p>\n\n\n\n We considered the end-to-end throughput of our system for the 530 billion parameters model with batch size 1920 on 280, 350, and 420 DGX A100 servers on Selene. We observed iteration time of 60.1, 50.2, and 44.4 seconds, respectively. These correspond to 126, 121, and 113 teraFLOP\/s per GPU, respectively.<\/p>\n\n\n\n We used the architecture of the transformer decoder, which is a left-to-right generative transformer-based language model consisting of 530 billion parameters. The number of layers, hidden dimensions, and attention heads are 105, 20480, and 128, respectively.<\/p>\n\n\n\n We used an 8-way tensor and 35-way pipeline parallelism. The sequence length is 2048 and the global batch size is 1920. Over the first 12 billion training tokens, we gradually increased the batch size by 32, starting at 32, until we reach the final batch size of 1920. We used one billion tokens for the learning rate warmup in our training.<\/p>\n\n\n\n We largely built our training dataset based on prior work, The Pile (opens in new tab)<\/span><\/a>. First, we selected the subset of datasets (the top 11 rows in Figure 2, below) from The Pile that we found to be of the highest relative quality. Then, following a similar approach as that used to generate Pile-CC (opens in new tab)<\/span><\/a>, we downloaded and filtered two recent Common Crawl (CC) snapshots (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n The steps we took for the CC data included text extraction from raw HTML files, scoring extracted documents using a classifier trained on high-quality data, and filtering documents according to their scores. To diversify the training, we also collected the RealNews (opens in new tab)<\/span><\/a> and CC-Stories (opens in new tab)<\/span><\/a> datasets.<\/p>\n\n\n\n Document deduplication is necessary in building training datasets because the same content can be present in multiple documents of different datasets. We used a fuzzy deduplication process at the document level using min-hash LSH to compute a sparse document graph and the connected components in it to identify duplicate documents.<\/p>\n\n\n\n We then used a priority order based on the quality of the datasets when selecting a representative document from the duplicate documents in each connected component. Finally, we used n<\/em>-gram based filtering to remove downstream task data from the training datasets to avoid contamination.<\/p>\n\n\n\n We ended with a set of 15 datasets consisting of a total of 339 billion tokens. During training, we opted to blend the datasets into heterogeneous batches according to variable sampling weights given in Figure 2, with an emphasis on higher-quality datasets. We trained the model on 270 billion tokens.<\/p>\n\n\n\nLarge-scale language models<\/h2>\n\n\n\n
Large-scale training infrastructure<\/h2>\n\n\n\n
Software design<\/h3>\n\n\n\n
Hardware system<\/h3>\n\n\n\n
System throughput<\/h3>\n\n\n\n
Training dataset and model configuration<\/h2>\n\n\n\n