{"id":688002,"date":"2020-08-31T10:00:46","date_gmt":"2020-08-31T17:00:46","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=688002"},"modified":"2020-10-06T16:11:55","modified_gmt":"2020-10-06T23:11:55","slug":"domain-specific-language-model-pretraining-for-biomedical-natural-language-processing","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/domain-specific-language-model-pretraining-for-biomedical-natural-language-processing\/","title":{"rendered":"Domain-specific language model pretraining for biomedical natural language processing"},"content":{"rendered":"\n<p>COVID-19 highlights a perennial problem facing scientists around the globe: how do we stay up to date with the cutting edge of scientific knowledge? In just a few months since the pandemic emerged, tens of thousands of research papers have been published concerning COVID-19 and the SARS-CoV-2 virus. This explosive growth sparks the creation of the COVID-19 Open Research Dataset (CORD-19) to facilitate research and discovery. However, a pandemic is just one salient example of a prevailing challenge to this community. PubMed, the standard repository for biomedical research articles, adds 4,000 new papers every day and over a million every year. <\/p>\n\n\n\n<p>It is impossible to keep track of such rapid progress by manual efforts alone. In the era of big data and precision medicine, the urgency has never been higher to advance natural language processing (NLP) methods that can help scientists stay versed in the deluge of information. NLP can help researchers quickly identify and cross-reference important findings in papers that are both directly and tangentially related to their own research at a large scale\u2014instead of researchers having to sift through papers manually for relevant findings or recall them from memory.<\/p>\n\n\n\n<p>In this blog post, we present our recent advances in pretraining neural language models for biomedical NLP. We question the prevailing assumption that pretraining on general-domain text is necessary and useful for specialized domains such as biomedicine. Instead, we show that biomedical text is very different from newswires and web text. By pretraining solely on biomedical text from scratch, our PubMedBERT model outperforms all prior language models and obtains new state-of-the-art results in a wide range of biomedical applications. To help accelerate progress in this vitally important area, we have created a comprehensive benchmark and released the<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/microsoft.github.io\/BLURB\/leaderboard.html\"> first leaderboard for biomedical NLP<\/a>. <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/domain-specific-language-model-pretraining-for-biomedical-natural-language-processing\/\">Our findings<\/a> might also be potentially applicable to other high-value domains, such as finance and law.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"711\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/COVID-papers-trend-1024x711.png\" alt=\"\" class=\"wp-image-688122\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/COVID-papers-trend-1024x711.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/COVID-papers-trend-300x208.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/COVID-papers-trend-768x533.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/COVID-papers-trend-1536x1067.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/COVID-papers-trend-2048x1423.png 2048w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h3 id=\"a-new-paradigm-for-building-neural-language-models-in-biomedicine-and-specialized-domains\">A new paradigm for building neural language models in biomedicine and specialized domains<\/h3>\n\n\n\n<p>Pretrained neural language models are the underpinning of state-of-the-art NLP methods. Pretraining works by masking some words from text and training a language model to predict them from the rest. Then, the pre-trained model can be fine-tuned for various downstream tasks using task-specific training data. As in mainstream NLP, prior work on pretraining is largely concerned about newswires and the Web. For applications in such general domains, the topic is not known <em>a priori<\/em>, it is thus advantageous to train a broad-coverage model using as much text as one could gather.<\/p>\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"932112\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">Microsoft Research Podcast<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/podcast\/ai-frontiers-ai-for-health-and-the-future-of-research-with-peter-lee\/\" aria-label=\"AI Frontiers: AI for health and the future of research with Peter Lee\" data-bi-cN=\"AI Frontiers: AI for health and the future of research with Peter Lee\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/03\/PeterLee_podcast-2023Mar_hero_1400x788.png\" alt=\"Peter Lee wearing glasses and smiling at the camera with the Microsoft Research Podcast logo to the left\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">AI Frontiers: AI for health and the future of research with Peter Lee<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p class=\"large\">Peter Lee, head of Microsoft Research, and Ashley Llorens, AI scientist and engineer, discuss the future of AI research and the potential for GPT-4 as a medical copilot.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/podcast\/ai-frontiers-ai-for-health-and-the-future-of-research-with-peter-lee\/\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" aria-label=\"Listen now\" data-bi-cN=\"AI Frontiers: AI for health and the future of research with Peter Lee\" target=\"_blank\">\n\t\t\t\t\t\t\tListen now\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n<span id=\"label-external-link\" class=\"sr-only\" aria-hidden=\"true\">Opens in a new tab<\/span>\t<\/div><!--\/.msr-promo-->\n\t\n\n\n<p>For specialized domains like biomedicine, which has abundant text that is drastically different from general-domain corpora, this rationale no longer applies. PubMed contains over 30 million abstracts, and PubMed Central (PMC) contains millions of full-text articles. Still, the prevailing assumption is that out-domain text, in this case text not related to biomedicine, can be helpful, so prior work typically adopts a mixed-domain approach by starting from a general-domain language model.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"951\" height=\"691\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/BioMed-NLP_figre1.jpg\" alt=\"\" class=\"wp-image-688005\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/BioMed-NLP_figre1.jpg 951w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/BioMed-NLP_figre1-300x218.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/BioMed-NLP_figre1-768x558.jpg 768w\" sizes=\"(max-width: 951px) 100vw, 951px\" \/><figcaption>Figure 1: Two paradigms for neural language model pretraining. Top: the prevailing mixed-domain paradigm assumes that out-domain text is still helpful and typically initializes domain-specific pretraining with a general-domain language model and inherits its vocabulary. Bottom: domain-specific pretraining from scratch derives the vocabulary and conducts pretraining using solely in-domain text.<br><\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td class=\"has-text-align-center\" data-align=\"center\"><strong>Biomedical Term<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Category<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>BERT<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>SciBERT<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>PubMedBERT (Ours) <\/strong><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">diabetes <\/td><td class=\"has-text-align-center\" data-align=\"center\"><span class=\"has-inline-color has-black-color\">disease <\/span><\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">leukemia<\/td><td class=\"has-text-align-center\" data-align=\"center\">disease<\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">lithium<\/td><td class=\"has-text-align-center\" data-align=\"center\">drug<\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">insulin<\/td><td class=\"has-text-align-center\" data-align=\"center\">drug<\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">DNA<\/td><td class=\"has-text-align-center\" data-align=\"center\">gene<\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">promoter<\/td><td class=\"has-text-align-center\" data-align=\"center\">gene<\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">hypertension<\/td><td class=\"has-text-align-center\" data-align=\"center\">disease<\/td><td class=\"has-text-align-center\" data-align=\"center\"><\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">nephropathy <\/td><td class=\"has-text-align-center\" data-align=\"center\">disease <\/td><td class=\"has-text-align-center\" data-align=\"center\"><\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">lymphoma <\/td><td class=\"has-text-align-center\" data-align=\"center\">disease<\/td><td class=\"has-text-align-center\" data-align=\"center\"><\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">lidocaine<\/td><td class=\"has-text-align-center\" data-align=\"center\">drug<\/td><td class=\"has-text-align-center\" data-align=\"center\"><\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">oropharyngeal<\/td><td class=\"has-text-align-center\" data-align=\"center\">organ<\/td><td class=\"has-text-align-center\" data-align=\"center\"><\/td><td class=\"has-text-align-center\" data-align=\"center\"><\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">cardiomyocyte<\/td><td class=\"has-text-align-center\" data-align=\"center\">cell<\/td><td class=\"has-text-align-center\" data-align=\"center\"><\/td><td class=\"has-text-align-center\" data-align=\"center\"><\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">chloramphenicol<\/td><td class=\"has-text-align-center\" data-align=\"center\">drug<\/td><td class=\"has-text-align-center\" data-align=\"center\"><\/td><td class=\"has-text-align-center\" data-align=\"center\"><\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">RecA<\/td><td class=\"has-text-align-center\" data-align=\"center\">gene<\/td><td class=\"has-text-align-center\" data-align=\"center\"><\/td><td class=\"has-text-align-center\" data-align=\"center\"><\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">acetyltransferase<\/td><td class=\"has-text-align-center\" data-align=\"center\">gene<\/td><td class=\"has-text-align-center\" data-align=\"center\"><\/td><td class=\"has-text-align-center\" data-align=\"center\"><\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">clonidine<\/td><td class=\"has-text-align-center\" data-align=\"center\">drug<\/td><td class=\"has-text-align-center\" data-align=\"center\"><\/td><td class=\"has-text-align-center\" data-align=\"center\"><\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">naloxone<\/td><td class=\"has-text-align-center\" data-align=\"center\">drug<\/td><td class=\"has-text-align-center\" data-align=\"center\"><\/td><td class=\"has-text-align-center\" data-align=\"center\"><\/td><td class=\"has-text-align-center\" data-align=\"center\">X<\/td><\/tr><\/tbody><\/table><figcaption>Table 1: Comparison of common biomedical terms in vocabularies used by the standard BERT, SciBERT and PubMedBERT (ours). A check mark indicates the biomedical term appears in the corresponding vocabulary, otherwise the term will be shattered into small sub-words.<\/figcaption><\/figure>\n\n\n\n<p><strong>We challenge this assumption and propose a new paradigm that pretrains entirely on in-domain text from scratch for a specialized domain.<\/strong> We observe that biomedical text is very different from general-domain text. As shown in the above figure, the standard BERT model pretrained on general-domain text only covers the most frequent biomedical terms. Others will be shattered to non-sensical sub-words. For example, lymphoma is represented as l, ##ym, ##ph, or ##oma. Acetyltransferase is reduced to ace, ##ty, ##lt, ##ran, ##sf, ##eras, or ##e. By contrast, our PubMedBERT treats biomedical terms as \u201cfirst-class citizens\u201d and avoids diverting precious modeling and compute bandwidth to irrelevant out-domain text.<\/p>\n\n\n\n<h3 id=\"creating-a-comprehensive-benchmark-and-leaderboard-to-accelerate-progress-in-biomedical-nlp\">Creating a comprehensive benchmark and leaderboard to accelerate progress in Biomedical NLP<\/h3>\n\n\n\n<p>Comprehensive benchmarks and leaderboards, such as <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/gluebenchmark.com\/\">GLUE<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, have greatly accelerated progress in general NLP. For biomedicine, however, such benchmarks and leaderboards are ostensibly absent. Prior work tends to use different tasks and datasets for downstream evaluation, which makes it hard to assess the true impact of biomedical pretraining strategies.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"648\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/BioMedNLP_Figure-2-Table.jpg\" alt=\"A table shows the comparison of the biomedical datasets in prior language model pretraining studies and BLURB.\" class=\"wp-image-688128\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/BioMedNLP_Figure-2-Table.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/BioMedNLP_Figure-2-Table-300x190.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/BioMedNLP_Figure-2-Table-768x486.jpg 768w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption>Table 2: Comparison of the biomedical datasets in prior language model pretraining studies and BLURB.<\/figcaption><\/figure>\n\n\n\n<p>To address this problem, we have created the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/microsoft.github.io\/BLURB\/\">Biomedical Language Understanding and Reasoning Benchmark <span class=\"sr-only\"> (opens in new tab)<\/span><\/a>(BLURB) for PubMed-based biomedical NLP applications. BLURB consists of 13 publicly available datasets in six diverse tasks including: named entity recognition, evidence-based medical information extraction, relation extraction, sentence similarity, document classification, and question answering (see Table 3). To avoid placing undue emphasis on tasks with many available datasets, such as named entity recognition (NER), BLURB reports the macro average across all tasks as the main score. We also have created a leaderboard to track progress by the community. The BLURB leaderboard is model agnostic. Any system capable of producing the test predictions using the same training and development data can participate. The main goal of BLURB is to lower the entry barrier into biomedical NLP and help accelerate progress in this vitally important area for societal and human impact.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"928\" height=\"486\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/BioMedNLP-Figure-3-Table.png\" alt=\"Image of a table showing the 13 publicly available datasets that make up BLURB span a wide variety of NLP tasks.\" class=\"wp-image-688131\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/BioMedNLP-Figure-3-Table.png 928w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/BioMedNLP-Figure-3-Table-300x157.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/BioMedNLP-Figure-3-Table-768x402.png 768w\" sizes=\"(max-width: 928px) 100vw, 928px\" \/><figcaption>Table 3: The 13 publicly available datasets that make up BLURB span a wide variety of NLP tasks.<\/figcaption><\/figure>\n\n\n\n<h3 id=\"pubmedbert-outperforming-all-prior-language-models-and-attaining-state-of-the-art-biomedical-nlp-results\">PubMedBERT: outperforming all prior language models and attaining state-of-the-art biomedical NLP results<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/Figure-4_BioMedNLP.jpg\" alt=\"A table shows  Summary of pretraining details for the various BERT models used in our experiments. Statistics for prior BERT models are taken from their publications when available. The size of a text corpus such as PubMed may vary a bit, depending on downloading time and preprocessing (such as filtering out empty or very short abstracts).\" class=\"wp-image-688134\" width=\"900\" height=\"271\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/Figure-4_BioMedNLP.jpg 972w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/Figure-4_BioMedNLP-300x90.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/Figure-4_BioMedNLP-768x232.jpg 768w\" sizes=\"(max-width: 900px) 100vw, 900px\" \/><figcaption>Table 4: Summary of pretraining details for the various BERT models used in our experiments. Statistics for prior BERT models are taken from their publications when available. The size of a text corpus such as PubMed may vary a bit, depending on downloading time and preprocessing (such as filtering out empty or very short abstracts).<\/figcaption><\/figure>\n\n\n\n<p>We pretrain our PubMedBERT model on biomedical text from scratch. The pretraining corpus comprises 14 million PubMed abstracts with 3 billion words (21 GB), after filtering empty or short abstracts. To enable fair comparison, we use the same amount of compute as in prior biomedical pretraining efforts. We also pretrain another version of PubMedBERT by adding full-text articles from PubMed Central, with the pretraining corpus increased substantially to 16.8 billion words (107 GB).<\/p>\n\n\n\n<p>As Table 5 shows, PubMedBERT consistently outperforms all prior language models across biomedical NLP applications, often by a significant margin. The gains are most substantial against general-domain models. Most notably, while RoBERTa uses the largest pretraining corpus, its performance on biomedical NLP tasks is among the worst, similar to the original BERT model. Models using biomedical text in pretraining generally perform better. However, mixing out-domain text in pretraining generally leads to worse performance. In particular, even though clinical notes are more relevant to the biomedical domain than general-domain text, adding them does not confer any advantage, as evident by the results of ClinicalBERT and BlueBERT. Not surprisingly, BioBERT is the closest to PubMedBERT, as it also uses PubMed text for pretraining. However, by conducting domain-specific pretraining from scratch, PubMedBERT is able to obtain consistent gains over BioBERT in most tasks.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"455\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/Figure-5_BioMed-NLP-Table-1024x455.jpg\" alt=\"A table shows PubMedBERT outperforms all prior neural language models in a wide range of biomedical NLP tasks from the BLURB benchmark.\" class=\"wp-image-688137\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/Figure-5_BioMed-NLP-Table-1024x455.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/Figure-5_BioMed-NLP-Table-300x133.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/Figure-5_BioMed-NLP-Table-768x341.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/Figure-5_BioMed-NLP-Table.jpg 1353w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption>Table 5: PubMedBERT outperforms all prior neural language models in a wide range of biomedical NLP tasks from the BLURB benchmark.<\/figcaption><\/figure>\n\n\n\n<h3 id=\"analysis-of-pretraining-and-task-specific-fine-tuning-strategies\">Analysis of pretraining and task-specific fine-tuning strategies<\/h3>\n\n\n\n<p>In addition to establishing the superiority of domain-specific pretraining, we have also conducted thorough analyses of the myriad choices in pretraining and task-specific fine-tuning, with several interesting findings:<\/p>\n\n\n\n<p>Pretraining on full-text articles generally leads to a slight degradation in performance compared with pretraining using only abstracts. We hypothesize that the reason is twofold. First, full text is generally noisier than abstracts. As existing biomedical NLP tasks are mostly based on abstracts, full text may be even slightly out-domain compared to abstracts. Moreover, even if full text can be helpful, their inclusion requires additional pretraining cycles to make use of extra information. Indeed, by extending pretraining for 60% longer, the overall results are slightly better than that of the standard PubMedBERT using only abstracts.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/BioMedNLP_Figure-6_Table.jpg\" alt=\"A table shows the evaluation of the impact of pretraining text on the performance of PubMedBERT on BLURB. The first result column corresponds to the standard PubMedBERT pretrained using PubMed abstracts (PubMed''). The second one corresponds to PubMedBERT trained using both PubMed abstracts and PubMed Central full text (PubMed+PMC''). The last one corresponds to PubMedBERT trained using both PubMed abstracts and PubMed Central full text, for 60% longer (``PubMed+PMC (longer training)'').\" class=\"wp-image-688140\" width=\"848\" height=\"548\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/BioMedNLP_Figure-6_Table.jpg 602w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/BioMedNLP_Figure-6_Table-300x194.jpg 300w\" sizes=\"(max-width: 848px) 100vw, 848px\" \/><figcaption>Table 6: Evaluation of the impact of pretraining text on the performance of PubMedBERT on BLURB. The first result column corresponds to the standard PubMedBERT pretrained using PubMed abstracts (<code>PubMed''). The second one corresponds to PubMedBERT trained using both PubMed abstracts and PubMed Central full text (<\/code>PubMed+PMC&#8221;). The last one corresponds to PubMedBERT trained using both PubMed abstracts and PubMed Central full text, for 60% longer (&#8220;PubMed+PMC (longer training)&#8221;).<\/figcaption><\/figure><\/div>\n\n\n\n<p><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2004.08994\">Adversarial pretraining<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> has been shown to be highly effective in boosting performance in general-domain applications. Surprisingly, it generally leads to a slight degradation in PubMedBERT performance (see Table 7). We hypothesize that the reason may be similar to what we observe in pretraining with full texts. Namely, adversarial training is more useful if the pretraining corpus is more diverse and relatively out-domain compared to the application tasks.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/BioMed-NLP_Figure-7-Table.jpg\" alt=\"A table shows the comparison of PubMedBERT performance on BLURB using standard and adversarial pretraining.\" class=\"wp-image-688143\" width=\"496\" height=\"510\"\/><figcaption>Table 7: Comparison of PubMedBERT performance on BLURB using standard and adversarial pretraining.<\/figcaption><\/figure><\/div>\n\n\n\n<p>Some common practices in named entity recognition and relation extraction may no longer be necessarily with the use of neural language models. Specifically, with the use of self-attention mechanism, the utility in explicit sequential modeling becomes questionable. In ablation studies, we find that a linear layer performs comparable or better than sequential modeling methods such as bi-directional LSTMs. For named entity recognition, the tagging scheme that simply differentiates between inside and outside of entity mentions suffices.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/Figure-8-_BioMedNLP_Table.jpg\" alt=\"A table shows comparison of linear layers vs recurrent neural networks for task-specific fine-tuning in named entity recognition (entity-level F1) and relation extraction (micro F1), all using the standard PubMedBERT.\" class=\"wp-image-688146\" width=\"404\" height=\"200\"\/><figcaption>Table 8: Comparison of linear layers vs recurrent neural networks for task-specific fine-tuning in named entity recognition (entity-level F1) and relation extraction (micro F1), all using the standard PubMedBERT.<\/figcaption><\/figure><\/div>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/BiomedNLP_Figure-9_table.jpg\" alt=\"A table shows the comparison of entity-level F1 for biomedical named entity recognition (NER) using different tagging schemes and the standard PubMedBERT.\" class=\"wp-image-688149\" width=\"401\" height=\"140\"\/><figcaption>Table 9: Comparison of entity-level F1 for biomedical named entity recognition (NER) using different tagging schemes and the standard PubMedBERT.<\/figcaption><\/figure><\/div>\n\n\n\n<h3 id=\"a-pretraining-method-for-specialized-domains-that-complements-generic-language-models\">A pretraining method for specialized domains that complements generic language models<\/h3>\n\n\n\n<p>To reiterate, we propose a new paradigm for domain-specific pretraining by learning neural language models from scratch entirely within a specialized domain. We show that for high-volume, high-value domains such as biomedicine, such a strategy outperforms all prior language models and obtains state-of-the-art results across the board in biomedical NLP applications.<\/p>\n\n\n\n<p>This strategy is not in conflict with pretraining a generic all-encompassing language model. Indeed, the two complement each other and are applicable in different situations. In many open-domain applications (such as search, ads, productivity suites, and others), one can\u2019t anticipate the domain for the application instances. Therefore, a general language model accounting for as many domains as possible is mandatory. On the other hand, for applications in a specialized domain with abundant text, domain-specific pretraining is superior and desirable. In biomedicine, prominent examples include extracting PubMed-scale knowledge graphs and compiling population-level real-world evidence from electronic medical records.<\/p>\n\n\n\n<p>This work builds on past advances in <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/blogs.microsoft.com\/ai\/jackson-lab-project-hanover\/\">biomedical NLP<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/podcast\/going-deep-on-deep-learning-with-dr-jianfeng-gao\/\">deep learning<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> at Microsoft Research. We are sharing our comprehensive benchmark, BLURB, and the first leaderboard for biomedical NLP at <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"http:\/\/aka.ms\/blurb\">http:\/\/aka.ms\/blurb<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. We hope these resources will help lower the barrier for entry and spur progress in this vitally important area. We have also released the state-of-the-art PubMedBERT and task-specific models at the BLURB website, which can be found under the \u201cModel\u201d tab there. We encourage researchers to participate in the leaderboard, and we hope that you will download and apply PubMedBERT and task-specific models to your own work.<\/p>\n\n\n\n<p><\/p>\n<span id=\"label-external-link\" class=\"sr-only\" aria-hidden=\"true\">Opens in a new tab<\/span>","protected":false},"excerpt":{"rendered":"<p>COVID-19 highlights a perennial problem facing scientists around the globe: how do we stay up to date with the cutting edge of scientific knowledge? In just a few months since the pandemic emerged, tens of thousands of research papers have been published concerning COVID-19 and the SARS-CoV-2 virus. This explosive growth sparks the creation of [&hellip;]<\/p>\n","protected":false},"author":38838,"featured_media":688173,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556,13545],"msr-region":[],"msr-event-type":[],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[952050],"related-projects":[],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Hoifung Poon","user_id":32016,"display_name":"Hoifung Poon","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/hoifung\/\" aria-label=\"Visit the profile page for Hoifung Poon\">Hoifung Poon<\/a>","is_active":false,"last_first":"Poon, Hoifung","people_section":0,"alias":"hoifung"},{"type":"user_nicename","value":"Jianfeng Gao","user_id":32246,"display_name":"Jianfeng Gao","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jfgao\/\" aria-label=\"Visit the profile page for Jianfeng Gao\">Jianfeng Gao<\/a>","is_active":false,"last_first":"Gao, Jianfeng","people_section":0,"alias":"jfgao"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/1400x788_BioMed_still_noLogo-5f496ecc6934d-960x540.png\" class=\"img-object-cover\" alt=\"a screenshot of a cell phone\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/1400x788_BioMed_still_noLogo-5f496ecc6934d-960x540.png 960w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/1400x788_BioMed_still_noLogo-5f496ecc6934d-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/1400x788_BioMed_still_noLogo-5f496ecc6934d-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/1400x788_BioMed_still_noLogo-5f496ecc6934d-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/08\/1400x788_BioMed_still_noLogo-5f496ecc6934d-640x360.png 640w\" sizes=\"(max-width: 960px) 100vw, 960px\" \/>","byline":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/hoifung\/\" title=\"Go to researcher profile for Hoifung Poon\" aria-label=\"Go to researcher profile for Hoifung Poon\" data-bi-type=\"byline author\" data-bi-cN=\"Hoifung Poon\">Hoifung Poon<\/a> and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jfgao\/\" title=\"Go to researcher profile for Jianfeng Gao\" aria-label=\"Go to researcher profile for Jianfeng Gao\" data-bi-type=\"byline author\" data-bi-cN=\"Jianfeng Gao\">Jianfeng Gao<\/a>","formattedDate":"August 31, 2020","formattedExcerpt":"COVID-19 highlights a perennial problem facing scientists around the globe: how do we stay up to date with the cutting edge of scientific knowledge? In just a few months since the pandemic emerged, tens of thousands of research papers have been published concerning COVID-19 and&hellip;","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/688002"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/38838"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=688002"}],"version-history":[{"count":10,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/688002\/revisions"}],"predecessor-version":[{"id":696468,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/688002\/revisions\/696468"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/688173"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=688002"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=688002"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=688002"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=688002"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=688002"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=688002"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=688002"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=688002"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=688002"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=688002"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}