Turing-NLG: A 17-billion-parameter language model by Microsoft
By Corby Rosset, Microsoft Applied Scientist
February 14, 2020
This figure was adapted from a similar image published in DistilBERT.
Massive deep learning language models (LM), such as BERT and GPT-2, with billions of parameters learned from essentially all the text published on the internet, have improved the state of the art on nearly every downstream natural language processing (NLP) task, including question answering, conversational agents, and document understanding among others.
Better natural language generation can be transformational for a variety of applications, such as assisting authors with composing their content, saving one time by summarizing a long piece of text, or improving customer experience with digital assistants. Following the trend that larger natural language models lead to better results, Microsoft Project Turing is introducing Turing Natural Language Generation (T-NLG), the largest model ever published at 17 billion parameters, which outperforms the state of the art on a variety of language modeling benchmarks and also excels when applied to numerous practical tasks, including summarization and question answering. This work would not be possible without breakthroughs produced by the DeepSpeed library (compatible with PyTorch) and ZeRO optimizer, which can be explored more in this accompanying blog post.
We are releasing a
private demo of T-NLG, including its freeform generation, question
answering, and summarization capabilities, to a small set of users
within the academic community for initial testing and feedback.
T-NLG: Benefits of a large generative language model
T-NLG is a Transformer-based generative language model, which means it can generate words to complete open-ended textual tasks. In addition to completing an unfinished sentence, it can generate direct answers to questions and summaries of input documents.
Generative models like T-NLG are important for NLP tasks since our goal is to respond as directly, accurately, and fluently as humans can in any situation. Previously, systems for question answering and summarization relied on extracting existing content from documents that could serve as a stand-in answer or summary, but they often appear unnatural or incoherent. With T-NLG we can naturally summarize or answer questions about a personal document or email thread.
We have observed that the bigger the model and the more diverse and comprehensive the pretraining data, the better it performs at generalizing to multiple downstream tasks even with fewer training examples. Therefore, we believe it is more efficient to train a large centralized multi-task model and share its capabilities across numerous tasks rather than train a new model for every task individually.
Pretraining T-NLG: Hardware and software breakthroughs
Any model with more than 1.3 billion parameters cannot fit into a single GPU (even one with 32GB of memory), so the model itself must be parallelized, or broken into pieces, across multiple GPUs. We took advantage of several hardware and software breakthroughs to achieve training T-NLG:
1. We leverage a NVIDIA DGX-2 hardware setup, with InfiniBand connections so that communication between GPUs is faster than previously achieved.
2. We apply tensor slicing to shard the model across four NVIDIA V100 GPUs on the NVIDIA Megatron-LM framework.
3. DeepSpeed with ZeRO allowed us to reduce the model-parallelism degree (from 16 to 4), increase batch size per node by fourfold, and reduce training time by three times. DeepSpeed makes training very large models more efficient with fewer GPUs, and it trains at batch size of 512 with only 256 NVIDIA GPUs compared to 1024 NVIDIA GPUs needed by using Megatron-LM alone. DeepSpeed is compatible with PyTorch.
The resulting T-NLG model has 78 Transformer layers with a hidden size of 4256 and 28 attention heads. To make results comparable to Megatron-LM, we pretrained the model with the same hyperparameters and learning schedule as Megatron-LM using autoregressive generation loss for 300,000 steps of batch size 512 on sequences of 1024 tokens. The learning schedule followed 3,200 steps of linear warmup up to a maximum learning rate of 1.5×10-4 and cosine decay over 500,000 steps, with FP16. We trained the model on the same type of data that Megatron-LM models were trained on.
We also compared the performance of the pretrained T-NLG model on standard language tasks such as WikiText-103 perplexity (lower is better) and LAMBADA next word prediction accuracy (higher is better). The table below shows that we achieve the new state of the art on both LAMBADA and WikiText-103. Megatron-LM is the publicly released results from the NVIDIA Megatron model.
Open AI used additional processing (stopword filtering) to achieve higher numbers than the model achieved alone. Neither Megatron nor T-NLG use this stopword filtering technique.
Figure 1 below shows how T-NLG performs when compared with Megatron-LM on validation perplexity.
Direct question answering and zero shot question capabilities
Many web search users are accustomed to seeing a direct answer card displayed at the top of the results page when they ask a question. Most of those cards show an answer sentence within the context of the paragraph it originated from. Our goal is to more plainly satisfy users’ information needs by responding directly to their question. For instance, most search engines would highlight the name “Tristan Prettyman” below when showing the full passage (see example below).
Instead, T-NLG will directly answer the question with a complete sentence. This capability is more important outside of web search—for example, this can power AI assistants to intelligently respond when a user asks a question about their personal data such as emails or Word documents.
The model is also capable of “zero shot” question answering, meaning answering without a context passage. For the examples below, there was no passage given to the model, just the question. In these cases, the model relies on knowledge gained during pretraining to generate an answer.
Since ROUGE scores with the ground truth answer don’t capture other aspects of quality, like factual correctness and grammatical correctness, we asked human annotators to evaluate those qualities for our previous baseline system—an LSTM model similar to CopyNet—and our current T-NLG model. There is still work to be done to enable automatic evaluation of factual correctness.
We also note that a larger pretrained model requires fewer instances of downstream tasks to learn them well. We only had, at most, 100,000 examples of “direct” answer question-passage-answer triples, and even after only a few thousand instances of training, we had a model that outperformed the LSTM baseline that was trained on multiple epochs of the same data. This observation has real business impact, since it is expensive to collect annotated supervised data.
Abstractive summarization with less supervision
There are two types of summarization in the NLP literature: extractive—taking a small number of sentences from the document as a surrogate of a summary—and abstractive—generating a summary with an NLG model as a human would. Rather than copying existing content, our goal for T-NLG is to write human-like abstractive summaries for a wide range of text documents: emails, blog posts, Word documents, and even Excel sheets and PowerPoint presentations. One of the main challenges is a lack of supervised training data for all these scenarios: humans don’t always explicitly summarize each of these document types. The power of T-NLG is that it is already so adept at understanding text that it doesn’t need much supervision to outperform all the techniques we’ve employed previously.
To make T-NLG as
versatile as possible for summarizing different types of text, we
finetuned the T-NLG model in a multi-task fashion on nearly all publicly
available summarization datasets, amounting to approximately four
million training instances. We report
(a proxy for how well the generated summary exactly matches the unigrams
and bigrams in a reference summary) to compare with another recent
Transformer-based language model known as
and previous state of the art models.
T-NLG future applications
T-NLG has advanced the state of the art in natural language generation, providing new opportunities for Microsoft and our customers. Beyond saving our users time by summarizing documents and emails, T-NLG can enhance experiences with the Microsoft Office suite by offering writing assistance to authors and answering questions that readers may ask about a document. Furthermore, it paves the way for more fluent chatbots and digital assistants, as natural language generation can help businesses with customer relationship management and sales by conversing with customers. We are excited by the new possibilities as we continue to advance the quality of language models.
About Project Turing: T-NLG is part of a larger initiative called Project Turing, an applied research group that works to evolve Microsoft products with the adoption of deep learning for both text and image processing. Our work is actively being integrated into multiple Microsoft products including Bing, Office, and Xbox. If you are excited about cutting-edge deep learning research and applications in NLP, or want to learn more, please see our careers page.
If you would like to nominate your organization for a private preview of Semantic Search by Project Turing, submit here.