{"id":638685,"date":"2020-02-26T16:24:46","date_gmt":"2020-02-22T01:06:09","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=638685"},"modified":"2020-02-26T16:24:47","modified_gmt":"2020-02-27T00:24:47","slug":"expanding-concept-understanding-in-microsoft-academic-graph","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/expanding-concept-understanding-in-microsoft-academic-graph\/","title":{"rendered":"Expanding Concept Understanding in Microsoft Academic Graph"},"content":{"rendered":"

With advancements in science and technology happening at a mind-boggling pace, the formal recognition and classification of new concepts and fields of study is a constant struggle. Over the past year the Microsoft Academic Graph (MAG) team has tried to tackle this problem head-on with changes to both how we find and how we categorize new fields of study.<\/p>\n

These efforts have resulted in MAG now understanding over 700k fields of study, with nearly 75% included in our field of study hierarchy. That marks a 3x increase from our most recent disclosure in our 2018 ACL demo paper (opens in new tab)<\/span><\/a>.<\/p>\n

New Fields of Study<\/h2>\n

Since the publication of our 2018 ACL demo paper (opens in new tab)<\/span><\/a>, we have released two new field of study batches. The first batch is the result of a one-time effort which seeded concepts using the Unified Medical Language System (UMLS) (opens in new tab)<\/span><\/a> vocabulary, allowing MAG to dramatically increase its understanding of biomedical research. The second batch is the result of an on-going effort that allows us to automatically discover and understand new concepts directly from academic literature, without the need to use pre-existing vocabulary seeds such as Wikipedia or UMLS.<\/p>\n

Metrics for each release:<\/p>\n

\"\"<\/p>\n

January 2019 update \u2013 Concepts seeded from Unified Medical Language System (UMLS)<\/h3>\n

The Unified Medical Language System (UMLS) (opens in new tab)<\/span><\/a> is a repository of biomedical vocabularies developed by the US National Library of Medicine (NLM) (opens in new tab)<\/span><\/a> with sources from multiple datasets and standards (opens in new tab)<\/span><\/a>. The latest 2019AB release (opens in new tab)<\/span><\/a> contains more than 4 million medical concepts.<\/p>\n

Large, complex data sources such as this typically have numerous, inherent limitations on their data quality. For UMLS these include structural inconsistencies such as cycles in graph hierarchy, semantic inconsistencies between different vocabularies and missing hierarchal relationships. See this journal article (opens in new tab)<\/span><\/a> and this presentation (opens in new tab)<\/span><\/a>, both from the UMLS authors at NLM, for more detailed information.<\/p>\n

To account for these issues, we conducted a rigorous process to determine which UMLS concepts met the bar to be included in MAG:<\/p>\n

    \n
  1. Generated term frequency (TF) metrics for each UMLS concept in the full MAG document corpus<\/li>\n
  2. Narrowed concepts to those not already in MAG, but with enough coverage in MAG documents<\/li>\n
  3. Isolated paragraphs containing the concepts from documents on reputable websites (i.e. well-known publishers, academic news sources, etc.) with the help of the Bing index<\/li>\n
  4. Applied concept modeling (as described here (opens in new tab)<\/span><\/a>) to generate descriptions<\/li>\n<\/ol>\n

    This resulted in over 435k new, high quality biomedical concepts being identified and ingested as fields of study in MAG. To get a better sense for how much this benefited MAGs understanding of Biology and Medicine, it is worth looking at field of study distribution across top level fields over the past few years:<\/p>\n

    \"\"<\/p>\n

    Not surprisingly, Biology, Medicine and Chemistry are the fields with the most benefits from the new concepts seeded using the UMLS vocabularies.<\/p>\n

    November\/December 2019 update \u2013 Detection of emerging concepts from academic documents<\/h3>\n

    The volume of new research being published is rapidly increasing, with MAG adding over 1 million new papers every month. This creates a unique challenge, as the new research comes along with a rich, ever-evolving set of emerging concepts. Identifying, describing and categorizing these new concepts in a timely fashion is an incredibly difficult task, which means timely inclusion in the data sources we traditionally use for seeds (Wikipedia, UMLS) virtually impossible.<\/p>\n

    To tackle this challenge, we have come up with a two-stage approach that allows us to extract both known and emerging concepts directly from MAG documents:<\/p>\n

      \n
    1. Analyze document vocabulary (words, phrases) to identify field of study mentions; this is a binary label that does not indicate which field of study just that a given word\/phrase should map to something<\/li>\n
    2. Run classifier to map mentions to specific fields of studies; these can be either existing or new fields<\/li>\n<\/ol>\n

      In the first stage, we formulate the concept detection as a self-supervised sequence labeling (opens in new tab)<\/span><\/a> problem. On a sampled set of MAG documents, we do lexical matching using the synonyms of our existing fields of study, which allows us to generate a binary label for each word indicating if we think it mentions a field of study or not. After generating these training labels, we fine-tune a transformer-based BERT model (opens in new tab)<\/span><\/a> (e.g. BERT base) as a context encoder, and use a Conditional Random Field (CRF) (opens in new tab)<\/span><\/a> layer as a tag decoder to train a binary classifier on each word in a sentence to detect field of study mentions. We then infer field of study mentions using the trained model on a larger set of high-quality MAG documents, i.e. those published in prestigious journals\/conferences.<\/p>\n

      During the second stage, we classify the field of study mentions detected in the first stage into three broad categories:<\/p>\n

        \n
      1. Existing concept<\/li>\n
      2. New concept<\/li>\n
      3. Low-quality word\/phrase<\/li>\n<\/ol>\n

        This is accomplished by searching for each mention using the Bing Web Search API (opens in new tab)<\/span><\/a> and clustering mentions into field of study \u201cidentities\u201d based on the URL relevance\/reputation and the consistency of the mention among top search results.<\/p>\n

        To ensure that this approach works for documents across various scientific domains, we conducted experiments training our model using documents in a single top domain (e.g. computer science) and with documents from mixed domains (e.g. computer science, biology). We observed that higher quality mentions are generated using models trained from a single domain rather than a mixture of multiple domains.<\/p>\n

        Based on this outcome, we applied this method to MAG documents from the Computer Science domain in November\/December 2019. The result is over 45k new fields of study being identified, described and categorized in MAG.<\/p>\n

        An example of one of these new fields of study is Graph Neural Networks (opens in new tab)<\/span><\/a> (GNN). As shown in the snapshot below, it is tagged with the most relevant and influential works in the GNN domain, such as node2vec and Graph Attention Networks (GAN); it also successfully identified the authors who pioneered the field (i.e. Jure Leskovec) together with other highly related (and newly discovered!) fields of study Network Embedding (opens in new tab)<\/span><\/a> and Network Representation Learning (opens in new tab)<\/span><\/a>.<\/p>\n

        \"\"<\/p>\n

        We have also enabled acronym detection for these new emerging concepts, which you can see the benefit of in Microsoft Academic\u2019s query formulation experience:<\/p>\n

        \"\"<\/p>\n

        Field of Study Hierarchy Updates<\/h2>\n

        Whenever we generate major new field of study updates for MAG, we also reconstruct our field of study hierarchy to include the new fields. To accomplish this, we manually curate the two top-most levels of the hierarchy to ensure accuracy, and then use the subsumption-based model described here to generate the remaining levels.<\/p>\n

        \"\"<\/p>\n

        Unfortunately, one known limitation of the subsumption-based model is missing relationships among fields of study with sparse references in the document corpus. In this case we treat the field of study as an orphan, meaning it does not have sufficient evidence to identify an immediate parent or child hierarchical relationship, which is a requirement for placement in MAG\u2019s formal hierarchy (FieldOfStudyChildren) (opens in new tab)<\/span><\/a>. As shown in the table above, this occurs for approximately 25% of all fields of study.<\/p>\n

        When this happens we still attempt to identify an appropriate hierarchical level for each orphaned field of study:<\/p>\n