{"id":638685,"date":"2020-02-26T16:24:46","date_gmt":"2020-02-22T01:06:09","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&#038;p=638685"},"modified":"2020-02-26T16:24:47","modified_gmt":"2020-02-27T00:24:47","slug":"expanding-concept-understanding-in-microsoft-academic-graph","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/expanding-concept-understanding-in-microsoft-academic-graph\/","title":{"rendered":"Expanding Concept Understanding in Microsoft Academic Graph"},"content":{"rendered":"<p>With advancements in science and technology happening at a mind-boggling pace, the formal recognition and classification of new concepts and fields of study is a constant struggle. Over the past year the Microsoft Academic Graph (MAG) team has tried to tackle this problem head-on with changes to both how we find and how we categorize new fields of study.<\/p>\n<p>These efforts have resulted in MAG now understanding over 700k fields of study, with nearly 75% included in our field of study hierarchy. That marks a 3x increase from our most recent disclosure in our <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/academic.microsoft.com\/paper\/2963464979\">2018 ACL demo paper<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n<h2>New Fields of Study<\/h2>\n<p>Since the publication of our <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/academic.microsoft.com\/paper\/2963464979\">2018 ACL demo paper<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, we have released two new field of study batches. The first batch is the result of a one-time effort which seeded concepts using the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/en.wikipedia.org\/wiki\/Unified_Medical_Language_System\">Unified Medical Language System (UMLS)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> vocabulary, allowing MAG to dramatically increase its understanding of biomedical research. The second batch is the result of an on-going effort that allows us to automatically discover and understand new concepts directly from academic literature, without the need to use pre-existing vocabulary seeds such as Wikipedia or UMLS.<\/p>\n<p>Metrics for each release:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-638691\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/02\/01.png\" alt=\"\" width=\"797\" height=\"202\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/02\/01.png 797w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/02\/01-300x76.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/02\/01-768x195.png 768w\" sizes=\"(max-width: 797px) 100vw, 797px\" \/><\/p>\n<h3>January 2019 update \u2013 Concepts seeded from Unified Medical Language System (UMLS)<\/h3>\n<p>The <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.nlm.nih.gov\/research\/umls\/index.html\">Unified Medical Language System (UMLS)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> is a repository of biomedical vocabularies developed by the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/en.wikipedia.org\/wiki\/United_States_National_Library_of_Medicine\">US National Library of Medicine (NLM)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> with sources from <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.nlm.nih.gov\/research\/umls\/sourcereleasedocs\/index.html\">multiple datasets and standards<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. The latest <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.nlm.nih.gov\/research\/umls\/knowledge_sources\/metathesaurus\/release\/statistics.html\">2019AB release<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> contains more than 4 million medical concepts.<\/p>\n<p>Large, complex data sources such as this typically have numerous, inherent limitations on their data quality. For UMLS these include structural inconsistencies such as cycles in graph hierarchy, semantic inconsistencies between different vocabularies and missing hierarchal relationships. See <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/academic.microsoft.com\/paper\/2159583324\">this journal article<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/mor.nlm.nih.gov\/pubs\/pres\/20050828-MIE-tutorial.pdf\">this presentation<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, both from the UMLS authors at NLM, for more detailed information.<\/p>\n<p>To account for these issues, we conducted a rigorous process to determine which UMLS concepts met the bar to be included in MAG:<\/p>\n<ol>\n<li>Generated term frequency (TF) metrics for each UMLS concept in the full MAG document corpus<\/li>\n<li>Narrowed concepts to those not already in MAG, but with enough coverage in MAG documents<\/li>\n<li>Isolated paragraphs containing the concepts from documents on reputable websites (i.e. well-known publishers, academic news sources, etc.) with the help of the Bing index<\/li>\n<li>Applied concept modeling (as described <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.frontiersin.org\/articles\/10.3389\/fdata.2019.00045\/full\">here<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>) to generate descriptions<\/li>\n<\/ol>\n<p>This resulted in over 435k new, high quality biomedical concepts being identified and ingested as fields of study in MAG. To get a better sense for how much this benefited MAGs understanding of Biology and Medicine, it is worth looking at field of study distribution across top level fields over the past few years:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-638712\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/02\/08.png\" alt=\"\" width=\"791\" height=\"674\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/02\/08.png 791w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/02\/08-300x256.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/02\/08-768x654.png 768w\" sizes=\"(max-width: 791px) 100vw, 791px\" \/><\/p>\n<p>Not surprisingly, Biology, Medicine and Chemistry are the fields with the most benefits from the new concepts seeded using the UMLS vocabularies.<\/p>\n<h3>November\/December 2019 update \u2013 Detection of emerging concepts from academic documents<\/h3>\n<p>The volume of new research being published is rapidly increasing, with MAG adding over 1 million new papers every month. This creates a unique challenge, as the new research comes along with a rich, ever-evolving set of emerging concepts. Identifying, describing and categorizing these new concepts in a timely fashion is an incredibly difficult task, which means timely inclusion in the data sources we traditionally use for seeds (Wikipedia, UMLS) virtually impossible.<\/p>\n<p>To tackle this challenge, we have come up with a two-stage approach that allows us to extract both known and emerging concepts directly from MAG documents:<\/p>\n<ol>\n<li>Analyze document vocabulary (words, phrases) to identify field of study mentions; this is a binary label that does not indicate which field of study just that a given word\/phrase should map to something<\/li>\n<li>Run classifier to map mentions to specific fields of studies; these can be either existing or new fields<\/li>\n<\/ol>\n<p>In the first stage, we formulate the concept detection as a self-supervised <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/academic.microsoft.com\/topic\/35639132\">sequence labeling<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> problem. On a sampled set of MAG documents, we do lexical matching using the synonyms of our existing fields of study, which allows us to generate a binary label for each word indicating if we think it mentions a field of study or not. After generating these training labels, we fine-tune a transformer-based <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/academic.microsoft.com\/paper\/2896457183\">BERT model<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (e.g. BERT base) as a context encoder, and use a <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/academic.microsoft.com\/topic\/152565575\">Conditional Random Field (CRF)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> layer as a tag decoder to train a binary classifier on each word in a sentence to detect field of study mentions. We then infer field of study mentions using the trained model on a larger set of high-quality MAG documents, i.e. those published in prestigious journals\/conferences.<\/p>\n<p>During the second stage, we classify the field of study mentions detected in the first stage into three broad categories:<\/p>\n<ol>\n<li>Existing concept<\/li>\n<li>New concept<\/li>\n<li>Low-quality word\/phrase<\/li>\n<\/ol>\n<p>This is accomplished by searching for each mention using the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/azure.microsoft.com\/en-us\/services\/cognitive-services\/bing-web-search-api\/\">Bing Web Search API<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and clustering mentions into field of study \u201cidentities\u201d based on the URL relevance\/reputation and the consistency of the mention among top search results.<\/p>\n<p>To ensure that this approach works for documents across various scientific domains, we conducted experiments training our model using documents in a single top domain (e.g. computer science) and with documents from mixed domains (e.g. computer science, biology). We observed that higher quality mentions are generated using models trained from a single domain rather than a mixture of multiple domains.<\/p>\n<p>Based on this outcome, we applied this method to MAG documents from the Computer Science domain in November\/December 2019. The result is over 45k new fields of study being identified, described and categorized in MAG.<\/p>\n<p>An example of one of these new fields of study is <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/academic.microsoft.com\/search?q=Graph%20neural%20networks&qe=And(Composite(F.FId%3D2989256011)%2CTy%3D%270%27)&f=&orderBy=0&skip=0&take=10\">Graph Neural Networks<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (GNN). As shown in the snapshot below, it is tagged with the most relevant and influential works in the GNN domain, such as node2vec and Graph Attention Networks (GAN); it also successfully identified the authors who pioneered the field (i.e. Jure Leskovec) together with other highly related (and newly discovered!) fields of study <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/academic.microsoft.com\/topic\/2984196740\">Network Embedding<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/academic.microsoft.com\/topic\/2988435680\">Network Representation Learning<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-638700\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/02\/04-1024x734.png\" alt=\"\" width=\"1024\" height=\"734\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/02\/04-1024x734.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/02\/04-300x215.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/02\/04-768x550.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/02\/04.png 1319w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p>We have also enabled acronym detection for these new emerging concepts, which you can see the benefit of in Microsoft Academic\u2019s query formulation experience:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-638703\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/02\/05.png\" alt=\"\" width=\"352\" height=\"312\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/02\/05.png 352w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/02\/05-300x266.png 300w\" sizes=\"(max-width: 352px) 100vw, 352px\" \/><\/p>\n<h2>Field of Study Hierarchy Updates<\/h2>\n<p>Whenever we generate major new field of study updates for MAG, we also reconstruct our field of study hierarchy to include the new fields. To accomplish this, we manually curate the two top-most levels of the hierarchy to ensure accuracy, and then use the subsumption-based model described here to generate the remaining levels.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-638697\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/02\/03.png\" alt=\"\" width=\"845\" height=\"451\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/02\/03.png 845w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/02\/03-300x160.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/02\/03-768x410.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/02\/03-710x380.png 710w\" sizes=\"(max-width: 845px) 100vw, 845px\" \/><\/p>\n<p>Unfortunately, one known limitation of the subsumption-based model is missing relationships among fields of study with sparse references in the document corpus.  In this case we treat the field of study as an orphan, meaning it does not have sufficient evidence to identify an immediate parent or child hierarchical relationship, which is a requirement for placement in <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/docs.microsoft.com\/en-us\/academic-services\/graph\/reference-data-schema#field-of-study-children\">MAG\u2019s formal hierarchy (FieldOfStudyChildren)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. As shown in the table above, this occurs for approximately 25% of all fields of study.<\/p>\n<p>When this happens we still attempt to identify an appropriate hierarchical level for each orphaned field of study:<\/p>\n<ul>\n<li>For ~90% of orphaned fields of study we lack sufficient data to associate the field of study with another field of study. These are associated with a default value of L3.<\/li>\n<li>For ~10% of orphaned fields of study we have sufficient data to associate the field of study with an L0 domain (i.e. Computer Science, Biology), but know that based on the quantity of papers labeled with it, it does not qualify as an L1 field of study. These are associated as L2.<\/li>\n<li>For <1% of orphaned fields of study, we have data to associate the field of study with an L3\/L4 field of study, but it is insufficient to form a concrete link. These are associated as L4\/L5.<\/li>\n<\/ul>\n<p>Regardless, equipped with the new 45K+ emerging concepts, our machine reading agents have been able to understand a significantly larger portion of MAG documents. Between the two MAG versions where the new emerging concepts were released (11-08-19 version vs. 01-10-20 version), we observed a 13.4% improvement (from 64.61% to 73.25%) in field of study coverage in the hierarchy.<\/p>\n<p>Another example that spans both the emerging field of study detection and hierarchy updates is <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/academic.microsoft.com\/topic\/2984196740\">network embedding<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, a research topic popularized in recent years among machine learning and deep learning communities. In our hierarchy, it is an L4 concept, with <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/academic.microsoft.com\/topic\/41608201\">embedding<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> as a parent field of study. As you can see in the screenshot below, its popularity as an emerging field of study is obvious by the fast-growing counts of publications and citations since 2016.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-638706\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/02\/06-1024x628.png\" alt=\"\" width=\"1024\" height=\"628\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/02\/06-1024x628.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/02\/06-300x184.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/02\/06-768x471.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/02\/06.png 1429w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p>It is also shown as one of the top five trending topics under embedding.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-638709\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/02\/07-1024x469.png\" alt=\"\" width=\"1024\" height=\"469\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/02\/07-1024x469.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/02\/07-300x137.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/02\/07-768x351.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/02\/07.png 1178w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<h3>Hierarchy data quality issue identified in December 2019\/January 2020<\/h3>\n<p><strong>Unfortunately, along with amazing progress also comes the occasional mistake.<\/strong><\/p>\n<p>During December 2019 and early January 2020, there were discussions on social media regarding some unexpected changes in relationships between L0 and L1 domains in our field of study hierarchy. It was the result of an unintended update due to backend engineering glitches and has since been fixed since our Jan-10-2020 graph release. We apologize for any issues this may have caused, and advise our customers <strong>not to use the L0 and L1 field of study hierarchy relationships<\/strong> in the following MAG versions:<\/p>\n<ul>\n<li>Nov-22-2019<\/li>\n<li>Dec-05-2019<\/li>\n<li>Dec-13-2019<\/li>\n<li>Dec-26-2019<\/li>\n<\/ul>\n<h2>We love feedback!<\/h2>\n<p>We would love to hear about your experiences in exploring the new expanded fields of study in MAG! Feel free to share your experiences using the feedback link at the bottom right of <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/academic.microsoft.com\/home\">Microsoft Academic<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, or if so inclined on Twitter @MSFTAcademic.<\/p>\n<p>Happy researching!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In-depth review of recent changes to the Microsoft Academic Graph (MAG) that enabled it to find and categorize 500k new fields of study<\/p>\n","protected":false},"author":36554,"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"msr-content-parent":170262,"footnotes":""},"research-area":[],"msr-locale":[268875],"class_list":["post-638685","msr-blog-post","type-msr-blog-post","status-publish","hentry","msr-locale-en_us"],"msr_assoc_parent":{"id":170262,"type":"project"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/638685"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/36554"}],"version-history":[{"count":7,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/638685\/revisions"}],"predecessor-version":[{"id":639645,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/638685\/revisions\/639645"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=638685"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=638685"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=638685"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}