{"id":783490,"date":"2021-10-11T06:00:00","date_gmt":"2021-10-11T13:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=783490"},"modified":"2021-10-11T05:52:10","modified_gmt":"2021-10-11T12:52:10","slug":"using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model\/","title":{"rendered":"Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World\u2019s Largest and Most Powerful Generative Language Model"},"content":{"rendered":"\n<p>We are excited to introduce the DeepSpeed- and Megatron-powered Megatron-Turing Natural Language Generation model (MT-NLG), the largest and the most powerful monolithic transformer language model trained to date, with 530 billion parameters. It is the result of a research collaboration between Microsoft and NVIDIA to further parallelize and optimize the training of very large AI models.<\/p>\n\n\n\n<p>As the successor to <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/turing-nlg-a-17-billion-parameter-language-model-by-microsoft\/\">Turing NLG 17B<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/NVIDIA\/Megatron-LM\" target=\"_blank\" rel=\"noreferrer noopener\">Megatron-LM<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, MT-NLG has 3x the number of parameters compared to the existing largest model of this type and demonstrates unmatched accuracy in a broad set of natural language tasks such as:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Completion prediction<\/li><li>Reading comprehension<\/li><li>Commonsense reasoning<\/li><li>Natural language inferences<\/li><li>Word sense disambiguation<\/li><\/ul>\n\n\n\n<p>The 105-layer, transformer-based MT-NLG improved upon the prior state-of-the-art models in zero-, one-, and few-shot settings and set the new standard for large-scale language models in both model scale and quality.<\/p>\n\n\n\n<h2 id=\"large-scale-language-models\">Large-scale language models<\/h2>\n\n\n\n<p>Transformer-based language models in natural language processing (NLP) have driven rapid progress in recent years fueled by computation at scale, large datasets, and advanced algorithms and software to train these models.<\/p>\n\n\n\n<p>Language models with large numbers of parameters, more data, and more training time acquire a richer, more nuanced understanding of language. As a result, they generalize well as effective zero\u2013 or few-shot learners, with high accuracy on many NLP tasks and datasets. Exciting downstream applications include summarization, automatic dialogue generation, translation, semantic search, and code autocompletion. It\u2019s no surprise that the number of parameters in state-of-the-art NLP models have grown at an exponential rate (Figure 1).<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"Figure 1. Trend of sizes of state-of-the-art NLP models over time\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2021\/10\/model-size-graph.jpg\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"661\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2021\/10\/model-size-graph-1024x661.jpg\" alt=\"Figure 1. Trend of sizes of state-of-the-art NLP models over time\" class=\"wp-image-783499\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2021\/10\/model-size-graph-1024x661.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2021\/10\/model-size-graph-300x194.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2021\/10\/model-size-graph-768x496.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2021\/10\/model-size-graph-1536x991.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2021\/10\/model-size-graph-240x155.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2021\/10\/model-size-graph.jpg 1956w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption>Figure 1. Trend of sizes of state-of-the-art NLP models over time<\/figcaption><\/figure><\/div>\n\n\n\n<p>Training such models, however, is challenging for two main reasons:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>It is no longer possible to fit the parameters of these models in the memory of even the largest GPU.<\/li><li>The large number of compute operations required can result in unrealistically long training times if special attention is not paid to optimizing the algorithms, software, and hardware stack all together.<\/li><\/ul>\n\n\n\n<p>Training MT-NLG was made feasible by numerous innovations and breakthroughs along all AI axes. For example, working closely together, NVIDIA and Microsoft achieved an unprecedented training efficiency by converging a state-of-the-art GPU-accelerated training infrastructure with a cutting-edge distributed learning software stack. We built high-quality, natural language training corpora with hundreds of billions of tokens, and co-developed training recipes to improve optimization efficiency and stability.<\/p>\n\n\n\n<p>In this post, we elaborate on each aspect of the training and describe our methods as well as results.<\/p>\n\n\n\n<h2 id=\"large-scale-training-infrastructure\">Large-scale training infrastructure<\/h2>\n\n\n\n<p>Powered by NVIDIA A100 Tensor Core GPUs and HDR InfiniBand networking, state-of-the-art supercomputing clusters such as the NVIDIA <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.top500.org\/system\/179842\/\">Selene<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and Microsoft Azure <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/virtual-machines\/nda100-v4-series\">NDv4<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> have enough compute power to train models with trillions of parameters within a reasonable timeframe. However, achieving the full potential of these supercomputers requires parallelism across thousands of GPUs, efficient and scalable on both memory and compute.&nbsp;<\/p>\n\n\n\n<p>In isolation, existing parallelism strategies such as data, pipeline, or tensor-slicing have trade-offs in memory and compute efficiency and cannot be used to train models at this scale.<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Data parallelism achieves good compute efficiency, but it replicates model states and cannot leverage aggregate distributed memory.<\/li><li>Tensor-slicing requires significant communication between GPUs that limits compute efficiency beyond a single node where high-bandwidth NVLink is not available.<\/li><li>Pipeline parallelism can scale efficiently across nodes. However, to be compute-efficient, it requires large batch sizes, coarse grain parallelism, and perfect load balancing, which is not possible at scale.<\/li><\/ul>\n\n\n\n<h3 id=\"software-design\">Software design<\/h3>\n\n\n\n<p>Through a collaboration between NVIDIA <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/NVIDIA\/Megatron-LM\">Megatron-LM<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and Microsoft <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/DeepSpeed\">DeepSpeed<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, we created an efficient and scalable 3D parallel system capable of combining data, pipeline, and tensor-slicing based parallelism together to address these challenges.<\/p>\n\n\n\n<p>By combining tensor-slicing and pipeline parallelism, we can operate them within the regime where they are most effective. More specifically, the system uses tensor-slicing from Megatron-LM to scale the model within a node and uses pipeline parallelism from DeepSpeed to scale the model across nodes.<\/p>\n\n\n\n<p>For example, for the 530 billion model, each model replica spans 280 NVIDIA A100 GPUs, with 8-way tensor-slicing within a node and 35-way pipeline parallelism across nodes. We then use data parallelism from DeepSpeed to scale out further to thousands of GPUs.<\/p>\n\n\n\n<h3 id=\"hardware-system\">Hardware system<\/h3>\n\n\n\n<p>Model training is done with mixed precision on the NVIDIA DGX SuperPOD-based <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.top500.org\/system\/179842\/\">Selene<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> supercomputer powered by 560 DGX A100 servers networked with HDR InfiniBand in a full fat tree configuration. Each DGX A100 has eight NVIDIA A100 80GB Tensor Core GPUs, fully connected to each other by<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.nvidia.com\/en-us\/data-center\/nvlink\/\"> NVLink and NVSwitch<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. A similar reference architecture is used by Microsoft for Azure NDv4 cloud supercomputers.<\/p>\n\n\n\n<h3 id=\"system-throughput\">System throughput<\/h3>\n\n\n\n<p>We considered the end-to-end throughput of our system for the 530 billion parameters model with batch size 1920 on 280, 350, and 420 DGX A100 servers on Selene. We observed iteration time of 60.1, 50.2, and 44.4 seconds, respectively. These correspond to 126, 121, and 113 teraFLOP\/s per GPU, respectively.<\/p>\n\n\n\n<h2 id=\"training-dataset-and-model-configuration\">Training dataset and model configuration<\/h2>\n\n\n\n<p>We used the architecture of the transformer decoder, which is a left-to-right generative transformer-based language model consisting of 530 billion parameters. The number of layers, hidden dimensions, and attention heads are 105, 20480, and 128, respectively.<\/p>\n\n\n\n<p>We used an 8-way tensor and 35-way pipeline parallelism. The sequence length is 2048 and the global batch size is 1920. Over the first 12 billion training tokens, we gradually increased the batch size by 32, starting at 32, until we reach the final batch size of 1920. We used one billion tokens for the learning rate warmup in our training.<\/p>\n\n\n\n<p>We largely built our training dataset based on prior work, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2101.00027\">The Pile<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. First, we selected the subset of datasets (the top 11 rows in Figure 2, below) from The Pile that we found to be of the highest relative quality. Then, following a similar approach as that used to generate <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2101.00027\">Pile-CC<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, we downloaded and filtered two recent <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/commoncrawl.org\/the-data\/get-started\/\">Common Crawl (CC) snapshots<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<p>The steps we took for the CC data included text extraction from raw HTML files, scoring extracted documents using a classifier trained on high-quality data, and filtering documents according to their scores. To diversify the training, we also collected the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/pdf\/1905.12616v3.pdf\">RealNews<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/1806.02847\">CC-Stories<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> datasets.<\/p>\n\n\n\n<p>Document deduplication is necessary in building training datasets because the same content can be present in multiple documents of different datasets. We used a fuzzy deduplication process at the document level using min-hash LSH to compute a sparse document graph and the connected components in it to identify duplicate documents.<\/p>\n\n\n\n<p>We then used a priority order based on the quality of the datasets when selecting a representative document from the duplicate documents in each connected component. Finally, we used <em>n<\/em>-gram based filtering to remove downstream task data from the training datasets to avoid contamination.<\/p>\n\n\n\n<p>We ended with a set of 15 datasets consisting of a total of 339 billion tokens. During training, we opted to blend the datasets into heterogeneous batches according to variable sampling weights given in Figure 2, with an emphasis on higher-quality datasets. We trained the model on 270 billion tokens.<\/p>\n\n\n\n<figure class=\"wp-block-table aligncenter is-style-stripes\"><table><thead><tr><th>Dataset<\/th><th>Dataset source<\/th><th class=\"has-text-align-center\" data-align=\"center\">Tokens (billions)<\/th><th class=\"has-text-align-center\" data-align=\"center\">Weight (%)<\/th><th class=\"has-text-align-center\" data-align=\"center\">Epochs<\/th><\/tr><\/thead><tbody><tr><td><strong>Books3<\/strong><\/td><td>Pile dataset<\/td><td class=\"has-text-align-center\" data-align=\"center\">25.7<\/td><td class=\"has-text-align-center\" data-align=\"center\">14.3<\/td><td class=\"has-text-align-center\" data-align=\"center\">1.5<\/td><\/tr><tr><td><strong>OpenWebText2<\/strong><\/td><td> Pile dataset <\/td><td class=\"has-text-align-center\" data-align=\"center\">14.8<\/td><td class=\"has-text-align-center\" data-align=\"center\">19.3<\/td><td class=\"has-text-align-center\" data-align=\"center\">3.6<\/td><\/tr><tr><td><strong>Stack Exchange<\/strong><\/td><td> Pile dataset <\/td><td class=\"has-text-align-center\" data-align=\"center\">11.6<\/td><td class=\"has-text-align-center\" data-align=\"center\">5.7<\/td><td class=\"has-text-align-center\" data-align=\"center\">1.4<\/td><\/tr><tr><td><strong>PubMed Abstracts<\/strong><\/td><td> Pile dataset <\/td><td class=\"has-text-align-center\" data-align=\"center\">4.4<\/td><td class=\"has-text-align-center\" data-align=\"center\">2.9<\/td><td class=\"has-text-align-center\" data-align=\"center\">1.8<\/td><\/tr><tr><td><strong>Wikipedia<\/strong><\/td><td> Pile dataset <\/td><td class=\"has-text-align-center\" data-align=\"center\">4.2<\/td><td class=\"has-text-align-center\" data-align=\"center\">4.8<\/td><td class=\"has-text-align-center\" data-align=\"center\">3.2<\/td><\/tr><tr><td><strong>Gutenberg (PG-19)<\/strong><\/td><td> Pile dataset <\/td><td class=\"has-text-align-center\" data-align=\"center\">2.7<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.9<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.9<\/td><\/tr><tr><td><strong>BookCorpus2<\/strong><\/td><td> Pile dataset <\/td><td class=\"has-text-align-center\" data-align=\"center\">1.5<\/td><td class=\"has-text-align-center\" data-align=\"center\">1.0<\/td><td class=\"has-text-align-center\" data-align=\"center\">1.8<\/td><\/tr><tr><td><strong>NIH ExPorter<\/strong><\/td><td> Pile dataset <\/td><td class=\"has-text-align-center\" data-align=\"center\">0.3<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.2<\/td><td class=\"has-text-align-center\" data-align=\"center\">1.8<\/td><\/tr><tr><td><strong>Pile-CC<\/strong><\/td><td> Pile dataset <\/td><td class=\"has-text-align-center\" data-align=\"center\">49.8<\/td><td class=\"has-text-align-center\" data-align=\"center\">9.4<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.5<\/td><\/tr><tr><td><strong>ArXiv<\/strong><\/td><td> Pile dataset <\/td><td class=\"has-text-align-center\" data-align=\"center\">20.8<\/td><td class=\"has-text-align-center\" data-align=\"center\">1.4<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.2<\/td><\/tr><tr><td><strong>GitHub<\/strong><\/td><td> Pile dataset <\/td><td class=\"has-text-align-center\" data-align=\"center\">24.3<\/td><td class=\"has-text-align-center\" data-align=\"center\">1.6<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.2<\/td><\/tr><tr><td><strong>CC-2020-50<\/strong><\/td><td>Common Crawl (CC) snapshot<\/td><td class=\"has-text-align-center\" data-align=\"center\">68.7<\/td><td class=\"has-text-align-center\" data-align=\"center\">13.0<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.5<\/td><\/tr><tr><td><strong>CC-2021-04<\/strong><\/td><td>Common Crawl (CC) snapshot<\/td><td class=\"has-text-align-center\" data-align=\"center\">82.6<\/td><td class=\"has-text-align-center\" data-align=\"center\">15.7<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.5<\/td><\/tr><tr><td><strong>RealNews<\/strong><\/td><td>RealNews<\/td><td class=\"has-text-align-center\" data-align=\"center\">21.9<\/td><td class=\"has-text-align-center\" data-align=\"center\">9.0<\/td><td class=\"has-text-align-center\" data-align=\"center\">1.1<\/td><\/tr><tr><td><strong>CC-Stories<\/strong><\/td><td>Common Crawl (CC) stories<\/td><td class=\"has-text-align-center\" data-align=\"center\">5.3<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.9<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.5<\/td><\/tr><\/tbody><\/table><figcaption><em>Figure 2. Datasets used to train the MT-NLG model. <\/em><\/figcaption><\/figure>\n\n\n\n<h2 id=\"results-and-achievements\">Results and achievements<\/h2>\n\n\n\n<p>Recent work in language models (LM) has demonstrated that a strong pretrained model can often perform competitively in a wide range of NLP tasks without finetuning.<\/p>\n\n\n\n<p>To understand how scaling up LMs strengthens their zero-shot or few-shot learning capabilities, we evaluated MT-NLG and demonstrate that it establishes new top results across several categories of NLP tasks. To ensure the evaluation was comprehensive, we selected eight tasks spanning five different areas:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>In the text prediction task LAMBADA, the model predicts the last word of a given paragraph.<\/li><li>In the reading comprehension tasks RACE-h and BoolQ, the model generates answers to questions based on a given paragraph.<\/li><li>In the commonsense reasoning tasks PiQA, HellaSwag, and Winogrande, each required some level of commonsense knowledge beyond statistical patterns of language to solve.<\/li><li>For natural language inference, two hard benchmarks, ANLI-R2 and HANS target the typical failure cases of past models.<\/li><li>The word sense disambiguation task WiC evaluates polysemy understanding from context.<\/li><\/ul>\n\n\n\n<p>To encourage reproducibility, we based our evaluation setting on the open-source project <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/EleutherAI\/lm-evaluation-harness\">lm-evaluation-harness<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and made task-specific changes as appropriate to align our setting more closely with prior work. We evaluated MT-NLG in zero-, one-, and few-shot settings without performing search for the optimal number of shots.<\/p>\n\n\n\n<p>Figure 3 shows the results for the accuracy metric. We ran the evaluation on the test set if it was publicly available; otherwise, we reported numbers on the dev set. This led to reporting LAMBADA, RACE-h, and ANLI-R2 on test sets and other tasks on dev sets.<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table><thead><tr><th>Category<\/th><th>Tasks<\/th><th class=\"has-text-align-center\" data-align=\"center\">Zero-shot<\/th><th class=\"has-text-align-center\" data-align=\"center\">One-shot<\/th><th class=\"has-text-align-center\" data-align=\"center\">Few-shot<\/th><\/tr><\/thead><tbody><tr><td>Completion prediction<\/td><td><strong>Lambada<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\">0.766*<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.731*<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.872*<\/td><\/tr><tr><td>Reading comprehension<\/td><td><strong>BoolQ<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\">0.782<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.825<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.848<\/td><\/tr><tr><td> Reading comprehension <\/td><td><strong>RACE-h<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\">0.479<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.484<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.479<\/td><\/tr><tr><td>Commonsense reasoning<\/td><td><strong>PiQA<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\">0.820*<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.810*<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.832*<\/td><\/tr><tr><td> Commonsense reasoning<\/td><td><strong>HellaSwag<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\">0.802<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.802<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.824<\/td><\/tr><tr><td> Commonsense reasoning<\/td><td><strong>WinoGrande<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\">0.730<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.737<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.789<\/td><\/tr><tr><td>Natural language inference<\/td><td><strong>ANLI-R2<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\">0.366<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.397<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.396<\/td><\/tr><tr><td>Natural language inference<\/td><td><strong>HANS<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\">0.607<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.649<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.702<\/td><\/tr><tr><td>Word sense disambiguation<\/td><td><strong>WiC<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\">0.486<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.513<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.585<\/td><\/tr><\/tbody><\/table><figcaption>Figure 3. Accuracy results for zero-, one-, and few-shot evaluations using MT-NLG. MT-NLG established the top results on the PiQA dev set and LAMBADA test set in all three settings (denoted by *) and top results among similar monolithic models in other categories.<\/figcaption><\/figure>\n\n\n\n<p>Take few-shot performance as an example. We observed encouraging improvements particularly in tasks involving comparison or finding relations between two sentences (for example, WiC and ANLI), a task category that is challenging for prior models. We observed similar improvements for most tasks in zero-shot and one-shot evaluation as well. We should also note that this model is trained on fewer tokens than previous models, showing the ability of larger models to learn even faster.<\/p>\n\n\n\n<p>For the HANS dataset, we did not find any baseline that reports dataset-wide metrics. According to the analysis by the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/1902.01007\">HANS paper<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, BERT baselines trained on <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/cims.nyu.edu\/~sbowman\/multinli\/paper.pdf\">MNLI<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> performs near-perfect on half of its subcategories while near-zero on the other half. That indicates that they are strongly dependent on the spurious syntactic heuristics identified by the paper.<\/p>\n\n\n\n<p>While our model still struggles, it predicts more than half the cases correct in zero-shot and improves even further when we present as few as one- and four-shots. Finally, in zero-, one-, and few-shot settings, our model established top results on the PiQA dev set and LAMBADA test set.<\/p>\n\n\n\n<p>In addition to reporting aggregate metrics on benchmark tasks, we also qualitatively analyzed model outputs and have intriguing findings (Figure 4). We observed that the model can infer basic mathematical operations from context (sample 1), even when the symbols are badly obfuscated (sample 2). While far from claiming numeracy, the model seems to go beyond only memorization for arithmetic.<\/p>\n\n\n\n<p>We also show samples (the last row in Figure 4) from the HANS task where we posed the task containing simple syntactic structures as a question and prompted the model for an answer. Despite the structures being simple, existing natural language inference (NLI) models often have a hard time with such inputs. <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/pdf\/1902.01007.pdf\">Fine-tuned models<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> often pick up spurious associations between certain syntactic structures and entailment relations from systemic biases in NLI datasets. MT-NLG performs competitively in such cases without finetuning. &nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table><thead><tr><th class=\"has-text-align-center\" data-align=\"center\">Tasks<\/th><th class=\"has-text-align-center\" data-align=\"center\">Sample 1<\/th><th class=\"has-text-align-center\" data-align=\"center\">Sample 2<\/th><\/tr><\/thead><tbody><tr><td class=\"has-text-align-center\" data-align=\"center\"><br><br><br><br><br><br>Inferring mathematical operations<\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Prompt<\/strong>:<br>Q: 11 # 2 = ?<br>A: 5.5<br>Q: 18 # 9 = ?<br>A: 2<br>Q: 35 # 4 = ?<br>A: 8.75<br>Q: 27 # 9 = ?<br>A: 3<br>Q: 75 # 6 = ?<br>A:<br><br><strong>MT-NLG<\/strong>:<br>12.5<br>Q: 45 # 5 = ?<br>A: 9<br>Q: 18 # 6 = ?<br>A: 3<\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Prompt<\/strong>:<br>Q: 11 $^% 2 = ?<br>A: 22<br>&nbsp;Q: 3 $^% 9 = ?<br>A: 27<br>&nbsp;Q: 12 $^% 4 = ?<br>A: 48<br>&nbsp;Q: 7 $^% 7 = ?<br>A: 49<br>&nbsp;Q: 8 $^% 5 = ?<br>A:<br><br><strong>MT-NLG<\/strong>:<br>40<br>Q: 9 $^% 6 = ?<br>A: 54<br>Q: 10 $^% 5 = ?<br>A: 50<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\"><br><br><br>Natural language inference on different syntactic structures<\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Prompt<\/strong>:<br>The banker forgot that the judge advised the professors.<br>Question: The judge advised the professors. True or False?<br>Answer:<br><br><strong>MT-NLG<\/strong>:<br>True. The banker forgot that the judge advised the professors.<\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Prompt<\/strong>:<br>Famous professors supported the secretary.<br>Question: Professors supported the secretary. True or False?<br>Answer:<br><br><strong>MT-NLG<\/strong>:<br>True. The secretary was supported by famous professors.<\/td><\/tr><\/tbody><\/table><figcaption>Figure 4. Samples from MT-NLG on inferring mathematical operations and natural language inference under different syntactic structures.<\/figcaption><\/figure>\n\n\n\n<h2 id=\"bias-in-language-models\">Bias in language models<strong><\/strong><\/h2>\n\n\n\n<p>While giant language models are advancing the state of the art on language generation, they also suffer from issues such as bias and toxicity. Understanding and removing these problems in language models is under active research by the AI community, including at Microsoft and NVIDIA.<\/p>\n\n\n\n<p>Our observations with MT-NLG are that the model picks up stereotypes and biases from the data on which it is trained. Microsoft and NVIDIA are committed to working on addressing this problem. We encourage continued research to help in quantifying the bias of the model.<\/p>\n\n\n\n<p>In addition, any use of MT-NLG in production scenarios must ensure that proper measures are put in place to mitigate and minimize potential harm to users. All work should follow such principles as those found in the <a href=\"https:\/\/www.microsoft.com\/en-us\/ai\/responsible-ai\">Microsoft Responsible AI Principles<\/a>. Those principles emphasize that fairness, reliability and safety, privacy and security, inclusiveness, transparency, and accountability are considered to be the key cornerstones of a responsible and trustworthy approach to development and use of AI.<\/p>\n\n\n\n<h2 id=\"conclusion\">Conclusion<strong><\/strong><\/h2>\n\n\n\n<p>We live in a time where AI advancements are far outpacing Moore\u2019s law. We continue to see more computation power being made available with newer generations of GPUs, interconnected at lightning speeds. At the same time, we continue to see hyperscaling of AI models leading to better performance, with seemingly no end in sight.<\/p>\n\n\n\n<p>Marrying these two trends together are software innovations that push the boundaries of optimization and efficiency. MT-NLG is an example of what is possible when supercomputers like NVIDIA Selene or Microsoft Azure NDv4 are used with software breakthroughs of Megatron-LM and DeepSpeed to train large language AI models.<\/p>\n\n\n\n<p>The quality and results that we have obtained today are a big step forward in the journey towards unlocking the full promise of AI in natural language. The innovations of DeepSpeed and Megatron-LM will benefit existing and future AI model development and make large AI models cheaper and faster to train. We look forward to how MT-NLG will shape tomorrow\u2019s products and motivate the community to push the boundaries of NLP even further. The journey is long and far from complete, but we are excited by what is possible and what lies ahead.<\/p>\n\n\n\n<h2 id=\"contributors\">Contributors<\/h2>\n\n\n\n<p>This project was made possible by the contributions of the following people:<\/p>\n\n\n\n<p><strong><em>NVIDIA: <\/em><\/strong><em>Mostofa Patwary, Mohammad Shoeybi, Patrick LeGresley, Shrimai Prabhumoye, Jared Casper, Vijay Korthikanti, Vartika Singh, Julie Bernauer, Michael Houston, Bryan Catanzaro<\/em><\/p>\n\n\n\n<p><strong><em>Microsoft<\/em><\/strong><em>: <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/shsmit\/\">Shaden Smith<\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.linkedin.com\/in\/bnorick\/\" target=\"_blank\" rel=\"noreferrer noopener\">Brandon Norick<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/samyamr\/\">Samyam Rajbhandari<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/zhunliu\/\">Zhun Liu<\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.linkedin.com\/in\/georgezerveas\/\" target=\"_blank\" rel=\"noreferrer noopener\">George Zerveas<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/eltonz\/\">Elton Zheng<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/reyazda\/\">Reza Yazdani Aminabadi<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/xiaso\/\">Xia Song<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/yuxhe\/\">Yuxiong He<\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.linkedin.com\/in\/jeffrey-s-zhu\/\" target=\"_blank\" rel=\"noreferrer noopener\">Jeffrey Zhu<\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.linkedin.com\/in\/jennifer-cruzan-248865132\/\" target=\"_blank\" rel=\"noreferrer noopener\">Jennifer Cruzan<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/umeshma\/\">Umesh Madan<\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.linkedin.com\/in\/luisvargas2\/\" target=\"_blank\" rel=\"noreferrer noopener\">Luis Vargas<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/satiwary\/\">Saurabh Tiwary<\/a><\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>We are excited to introduce the DeepSpeed- and Megatron-powered Megatron-Turing Natural Language Generation model (MT-NLG), the largest and the most powerful monolithic transformer language model trained to date, with 530 billion parameters. It is the result of a research collaboration between Microsoft and NVIDIA to further parallelize and optimize the training of very large AI [&hellip;]<\/p>\n","protected":false},"author":35981,"featured_media":783661,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-783490","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[691494,678390,649749],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Ali Alvi","user_id":38919,"display_name":"Ali Alvi","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/alialvi\/\" aria-label=\"Visit the profile page for Ali Alvi\">Ali Alvi<\/a>","is_active":false,"last_first":"Alvi, Ali","people_section":0,"alias":"alialvi"},{"type":"guest","value":"paresh-kharya","user_id":"783535","display_name":"Paresh Kharya","author_link":"<a href=\"https:\/\/blogs.nvidia.com\/blog\/author\/pareshkharya\/\" aria-label=\"Visit the profile page for Paresh Kharya\">Paresh Kharya<\/a>","is_active":true,"last_first":"Kharya, Paresh","people_section":0,"alias":"paresh-kharya"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2021\/10\/model-size-graph_1400x788-960x540.jpg\" class=\"img-object-cover\" alt=\"Figure 1. Trend of sizes of state-of-the-art NLP models over time\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2021\/10\/model-size-graph_1400x788-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2021\/10\/model-size-graph_1400x788-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2021\/10\/model-size-graph_1400x788-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2021\/10\/model-size-graph_1400x788-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2021\/10\/model-size-graph_1400x788-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2021\/10\/model-size-graph_1400x788-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2021\/10\/model-size-graph_1400x788-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2021\/10\/model-size-graph_1400x788-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2021\/10\/model-size-graph_1400x788-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2021\/10\/model-size-graph_1400x788-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2021\/10\/model-size-graph_1400x788.jpg 1400w\" sizes=\"(max-width: 960px) 100vw, 960px\" \/>","byline":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/alialvi\/\" title=\"Go to researcher profile for Ali Alvi\" aria-label=\"Go to researcher profile for Ali Alvi\" data-bi-type=\"byline author\" data-bi-cN=\"Ali Alvi\">Ali Alvi<\/a> and <a href=\"https:\/\/blogs.nvidia.com\/blog\/author\/pareshkharya\/\" title=\"Go to researcher profile for Paresh Kharya\" aria-label=\"Go to researcher profile for Paresh Kharya\" data-bi-type=\"byline author\" data-bi-cN=\"Paresh Kharya\">Paresh Kharya<\/a>","formattedDate":"October 11, 2021","formattedExcerpt":"We are excited to introduce the DeepSpeed- and Megatron-powered Megatron-Turing Natural Language Generation model (MT-NLG), the largest and the most powerful monolithic transformer language model trained to date, with 530 billion parameters. It is the result of a research collaboration between Microsoft and NVIDIA to&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/783490"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/35981"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=783490"}],"version-history":[{"count":19,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/783490\/revisions"}],"predecessor-version":[{"id":783616,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/783490\/revisions\/783616"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/783661"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=783490"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=783490"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=783490"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=783490"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=783490"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=783490"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=783490"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=783490"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=783490"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=783490"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=783490"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}