{"id":688002,"date":"2020-08-31T10:00:46","date_gmt":"2020-08-31T17:00:46","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=688002"},"modified":"2020-10-06T16:11:55","modified_gmt":"2020-10-06T23:11:55","slug":"domain-specific-language-model-pretraining-for-biomedical-natural-language-processing","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/domain-specific-language-model-pretraining-for-biomedical-natural-language-processing\/","title":{"rendered":"Domain-specific language model pretraining for biomedical natural language processing"},"content":{"rendered":"\n
COVID-19 highlights a perennial problem facing scientists around the globe: how do we stay up to date with the cutting edge of scientific knowledge? In just a few months since the pandemic emerged, tens of thousands of research papers have been published concerning COVID-19 and the SARS-CoV-2 virus. This explosive growth sparks the creation of the COVID-19 Open Research Dataset (CORD-19) to facilitate research and discovery. However, a pandemic is just one salient example of a prevailing challenge to this community. PubMed, the standard repository for biomedical research articles, adds 4,000 new papers every day and over a million every year. <\/p>\n\n\n\n
It is impossible to keep track of such rapid progress by manual efforts alone. In the era of big data and precision medicine, the urgency has never been higher to advance natural language processing (NLP) methods that can help scientists stay versed in the deluge of information. NLP can help researchers quickly identify and cross-reference important findings in papers that are both directly and tangentially related to their own research at a large scale\u2014instead of researchers having to sift through papers manually for relevant findings or recall them from memory.<\/p>\n\n\n\n
In this blog post, we present our recent advances in pretraining neural language models for biomedical NLP. We question the prevailing assumption that pretraining on general-domain text is necessary and useful for specialized domains such as biomedicine. Instead, we show that biomedical text is very different from newswires and web text. By pretraining solely on biomedical text from scratch, our PubMedBERT model outperforms all prior language models and obtains new state-of-the-art results in a wide range of biomedical applications. To help accelerate progress in this vitally important area, we have created a comprehensive benchmark and released the first leaderboard for biomedical NLP<\/a>. Our findings<\/a> might also be potentially applicable to other high-value domains, such as finance and law.<\/p>\n\n\n\n Pretrained neural language models are the underpinning of state-of-the-art NLP methods. Pretraining works by masking some words from text and training a language model to predict them from the rest. Then, the pre-trained model can be fine-tuned for various downstream tasks using task-specific training data. As in mainstream NLP, prior work on pretraining is largely concerned about newswires and the Web. For applications in such general domains, the topic is not known a priori<\/em>, it is thus advantageous to train a broad-coverage model using as much text as one could gather.<\/p>\n\n\n\n\t \n\t\tMicrosoft Research Podcast<\/span>\n\t<\/p>\n\t\n\t<\/figure>\n\n\n\n
A new paradigm for building neural language models in biomedicine and specialized domains<\/h3>\n\n\n\n