Natural language understanding (NLU) is one of the longest running goals in AI, and SuperGLUE is currently among the most challenging benchmarks for evaluating NLU models. The benchmark consists of a wide range of NLU tasks, including question answering, natural language inference, co-reference resolution, word sense disambiguation, and others. Take the causal reasoning task (COPA in Figure 1) as an example. Given the premise \u201cthe child became immune to the disease\u201d and the question \u201cwhat’s the cause for this?,\u201d the model is asked to choose an answer from two plausible candidates: 1) \u201che avoided exposure to the disease\u201d and 2) \u201che received the vaccine for the disease.\u201d While it is easy for a human to choose the right answer, it is challenging for an AI model. To get the right answer, the model needs to understand the causal relationship between the premise and those plausible options.<\/p>\n\n\n\n
Since its release in 2019, top research teams around the world have been developing large-scale pretrained language models (PLMs) that have driven striking performance improvement on the SuperGLUE benchmark. Microsoft recently updated the DeBERTa model by training a larger version that consists of 48 Transformer layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on SuperGLUE for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE benchmark rankings, outperforming the human baseline by a decent margin (90.3 versus 89.8). The model also sits at the top of the GLUE benchmark rankings with a macro-average score of 90.8.<\/p>\n\n\n\n
Microsoft will release the 1.5-billion-parameter DeBERTa model and the source code<\/a> to the public. In addition, DeBERTa is being integrated into the next version of the Microsoft Turing natural language representation model (Turing NLRv4). Our Turing models converge all language innovation across Microsoft, and they are then trained at large scale to support products like Bing, Office, Dynamics, and Azure Cognitive Services, powering a wide range of scenarios involving human-machine and human-human interactions via natural language (such as chatbot, recommendation, question answering, search, personal assist, customer support automation, content generation, and others) to benefit hundreds of millions of users through the Microsoft AI at Scale<\/a> initiative.<\/p>\n\n\n\n
DeBERTa <\/a>(Decoding-enhanced BERT with disentangled attention) is a Transformer-based neural language model pretrained on large amounts of raw text corpora using self-supervised learning. Like other PLMs, DeBERTa is intended to learn universal language representations that can be adapted to various downstream NLU tasks. DeBERTa improves previous state-of-the-art PLMs (for example, BERT<\/a>, RoBERTa<\/a>, UniLM<\/a>) using three novel techniques (illustrated in Figure 2): a disentangled attention mechanism, an enhanced mask decoder, and a virtual adversarial training method for fine-tuning.<\/p>\n\n\n\n
Unlike BERT, where each word in the input layer is represented using a vector that sums its word (content) embedding and position embedding, each word in DeBERTa is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices based on their contents and relative positions, respectively. This is motivated by the observation that the attention weight (which measures the strength of word-word dependency) of a word pair depends on not only their contents but also their relative positions. For example, the dependency between the words \u201cdeep\u201d and \u201clearning\u201d is much stronger when they occur next to each other than when they occur in different sentences.<\/p>\n\n\n\n
Like BERT, DeBERTa is pretrained using masked language modeling (MLM). MLM is a fill-in-the-blank task, where a model is taught to use the words surrounding a mask token to predict what the masked word should be. DeBERTa uses the content and position information of the context words for MLM. The disentangled attention mechanism already considers the contents and relative positions of the context words, but not the absolute positions of these words, which in many cases are crucial for the prediction.<\/p>\n\n\n\n
Consider the sentence \u201ca new store <\/em>opened beside the new mall<\/em>\u201d with the italicized words \u201cstore\u201d and \u201cmall\u201d masked for prediction. Although the local contexts of the two words are similar, they play different syntactic roles in the sentence. (Here, the subject of the sentence is \u201cstore\u201d not \u201cmall,\u201d for example.) These syntactical nuances depend, to a large degree, upon the words\u2019 absolute positions in the sentence, and so it is important to account for a word\u2019s absolute position in the language modeling process. DeBERTa incorporates absolute word position embeddings right before the softmax layer where the model decodes the masked words based on the aggregated contextual embeddings of word contents and positions.<\/p>\n\n\n\n\n\t
\n\t\tMicrosoft research blog<\/span>\n\t<\/p>\n\t\n\t
PromptWizard from Microsoft Research is now open source. It is designed to automate and simplify AI prompt optimization, combining iterative LLM feedback with efficient exploration and refinement techniques to create highly effective prompts in minutes.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t