{"id":635250,"date":"2020-02-13T13:14:29","date_gmt":"2020-02-10T17:04:49","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=635250"},"modified":"2020-02-13T13:14:31","modified_gmt":"2020-02-13T21:14:31","slug":"turing-nlg-a-17-billion-parameter-language-model-by-microsoft","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/turing-nlg-a-17-billion-parameter-language-model-by-microsoft\/","title":{"rendered":"Turing-NLG: A 17-billion-parameter language model by Microsoft"},"content":{"rendered":"
<\/p>\n
This figure was adapted from a similar image published in DistilBERT (opens in new tab)<\/span><\/a>.<\/p>\n Turing Natural Language Generation (T-NLG) is a 17 billion parameter language model by Microsoft that outperforms the state of the art on many downstream NLP tasks. We present a demo of the model, including its freeform generation, question answering, and summarization capabilities, to academics for feedback and research purposes. <|endoftext|><\/em><\/strong><\/p>\n \u00a0– This summary was generated by the Turing-NLG language model itself.<\/p>\n<\/blockquote>\n Massive deep learning language models (LM), such as BERT (opens in new tab)<\/span><\/a> and GPT-2 (opens in new tab)<\/span><\/a>, with billions of parameters learned from essentially all the text published on the internet, have improved the state of the art on nearly every downstream natural language processing (NLP) task, including question answering, conversational agents, and document understanding among others.<\/p>\n Better natural language generation can be transformational for a variety of applications, such as assisting authors with composing their content, saving one time by summarizing a long piece of text, or improving customer experience with digital assistants. Following the trend that larger natural language models lead to better results, Microsoft Project Turing (opens in new tab)<\/span><\/a> is introducing Turing Natural Language Generation (T-NLG), the largest model ever published at 17 billion parameters, which outperforms the state of the art on a variety of language modeling benchmarks and also excels when applied to numerous practical tasks, including summarization and question answering. This work would not be possible without breakthroughs produced by the DeepSpeed library (opens in new tab)<\/span><\/a><\/b>\u00a0(compatible with PyTorch (opens in new tab)<\/span><\/a>) and ZeRO optimizer (opens in new tab)<\/span><\/a>, which can be explored more in this accompanying blog post. (opens in new tab)<\/span><\/a><\/p>\n We are releasing a private demo of T-NLG, including its freeform generation, question answering, and summarization capabilities, to a small set of users within the academic community for initial testing and feedback. T-NLG is a Transformer-based (opens in new tab)<\/span><\/a> generative language model, which means it can generate words to complete open-ended textual tasks. In addition to completing an unfinished sentence, it can generate direct answers to questions and summaries of input documents.<\/p>\n Generative models like T-NLG are important for NLP tasks since our goal is to respond as directly, accurately, and fluently as humans can in any situation. Previously, systems for question answering and summarization relied on extracting existing content from documents that could serve as a stand-in answer or summary, but they often appear unnatural or incoherent. With T-NLG we can naturally summarize or answer questions about a personal document or email thread.<\/p>\n We have observed that the bigger the model and the more diverse and comprehensive the pretraining data, the better it performs at generalizing to multiple downstream tasks even with fewer training examples. Therefore, we believe it is more efficient to train a large centralized multi-task model and share its capabilities across numerous tasks rather than train a new model for every task individually.<\/p>\n Any model with more than 1.3 billion parameters cannot fit into a single GPU (even one with 32GB of memory), so the model itself must be parallelized, or broken into pieces, across multiple GPUs. We took advantage of several hardware and software breakthroughs to achieve training T-NLG:<\/p>\n 1. We leverage a NVIDIA DGX-2 hardware setup, with InfiniBand connections so that communication between GPUs is faster than previously achieved.<\/p>\n 2. We apply tensor slicing to shard the model across four NVIDIA V100 GPUs on the NVIDIA Megatron-LM framework.<\/p>\n 3. DeepSpeed with\u00a0ZeRO (opens in new tab)<\/span><\/a><\/u> allowed us to reduce the model-parallelism degree (from 16 to 4), increase batch size per node by fourfold, and reduce training time by three times. DeepSpeed makes training very large models more efficient with fewer GPUs, and it trains at batch size of 512 with only 256 NVIDIA GPUs compared to 1024 NVIDIA GPUs needed by using Megatron-LM alone. DeepSpeed is compatible with\u00a0PyTorch (opens in new tab)<\/span><\/a>.<\/p>\n The resulting T-NLG model has 78 Transformer layers with a hidden size of 4256 and 28 attention heads. To make results comparable to Megatron-LM, we pretrained the model with the same hyperparameters and learning schedule as Megatron-LM using autoregressive generation loss for 300,000 steps of batch size 512 on sequences of 1024 tokens. The learning schedule followed 3,200 steps of linear warmup up to a maximum learning rate of 1.5×10-4 <\/sup>and cosine decay over 500,000 steps, with FP16 (opens in new tab)<\/span><\/a>. We trained the model on the same type of data that Megatron-LM models were trained on.<\/p>\n We also compared the performance of the pretrained T-NLG model on standard language tasks such as WikiText-103 (opens in new tab)<\/span><\/a> perplexity (lower is better) and LAMBADA (opens in new tab)<\/span><\/a> next word prediction accuracy (higher is better). The table below shows that we achieve the new state of the art on both LAMBADA and WikiText-103. Megatron-LM is the publicly released results from the NVIDIA Megatron model.<\/p>\n *Open AI used additional processing (stopword filtering) to achieve higher numbers than the model achieved alone. Neither Megatron nor T-NLG use this stopword filtering technique.<\/p><\/div>\n Figure 1 below shows how T-NLG performs when compared with Megatron-LM on validation perplexity.<\/p>\n Figure 1: Comparison of the validation perplexity of Megatron-8B parameter model (orange line) vs T-NLG 17B model during training (blue and green lines). The dashed line represents the lowest validation loss achieved by the current public state of the art model. The transition from blue to green in the figure indicates where T-NLG outperforms public state of the art.<\/p><\/div>\n Many web search users are accustomed to seeing a direct answer card displayed at the top of the results page when they ask a question. Most of those cards show an answer sentence within the context of the paragraph it originated from. Our goal is to more plainly satisfy users\u2019 information needs by responding directly to their question. For instance, most search engines would highlight the name \u201cTristan Prettyman\u201d below when showing the full passage (see example below).<\/p>\n\n
\n<\/span><\/p>\nT-NLG: Benefits of a large generative language model<\/h3>\n
Pretraining T-NLG: Hardware and software breakthroughs<\/h3>\n
Direct question answering and zero shot question capabilities<\/h3>\n