{"id":643275,"date":"2020-03-16T12:13:33","date_gmt":"2020-03-16T19:13:33","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=643275"},"modified":"2020-03-20T12:20:22","modified_gmt":"2020-03-20T19:20:22","slug":"microsoft-academic-resources-and-their-application-to-covid-19-research","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/microsoft-academic-resources-and-their-application-to-covid-19-research\/","title":{"rendered":"Microsoft Academic resources and their application to COVID-19 research"},"content":{"rendered":"

This post will be updated as examples and data are added<\/em><\/p>\n

Given the circumstances surrounding the COVID-19 pandemic, we would like to provide an overview of the services that we provide, explain the focus of each and provide working examples on how to best use our data and to help generate insight from coronavirus-related scholarly communications.<\/p>\n

We would also like to recognize that we are partnering with the Allen Institute for AI, Chan Zuckerberg Initiative (CZI), Georgetown University\u2019s Center for Security and Emerging Technology (CSET), and the National Library of Medicine (NLM) at the National Institutes of Health to produce an open research dataset of scholarly literature about COVID-19, SARS-COV-2 and the coronavirus group.\u00a0 Please visit the link below to access the dataset that was released on Semantic Scholar today:<\/p>\n

COVID-19 Open Research Dataset (CORD-19) (opens in new tab)<\/span><\/a><\/p>\n

Over the past week our team has worked to update our graph with the most recent publications regarding COVID-19.\u00a0 With the support of Bing, we will double our MAG update frequency as well as publish a side stream linking WHO and PubMed publication ID’s to MAG ID’s.\u00a0\u00a0Side stream source code<\/em> (opens in new tab)<\/span><\/a><\/p>\n

For anyone who would like to use the releases of our available services for further investigation, we provide the following summaries and examples:<\/p>\n

The Microsoft Academic Graph (opens in new tab)<\/span><\/a> is a heterogeneous graph of academic data.\u00a0 The content of the graph is distributed in text files when new builds are created via a subscription.\u00a0 This graph can be used as is or merged with other public or private data sets.<\/p>\n

Primary use cases<\/em>:\u00a0 Data mining, long running analytic processing for academic analytics or Business Intelligence.<\/p>\n

The Microsoft Academic Knowledge Exploration Service (MAKES) (opens in new tab)<\/span><\/a> and its predecessor the Project Academic Knowledge API were created to serve a need for indexing and rapid data retrieval from the MAG data set.<\/p>\n

Primary use cases:\u00a0 <\/em>Fast, top N entity retrieval, on-line scenarios such as dashboards or search applications.<\/p>\n

The Microsoft Academic website (opens in new tab)<\/span><\/a> is for the community to be used as both a research tool and an example of what can be built with MAG data and MAKES.\u00a0 We have made our own modifications, just as consumers of MAG and MAKES may choose to do.\u00a0 In this way we use it as an experimental platform to test hypotheses and new features for both MAG and MAKES.<\/p>\n

Primary use cases: <\/em>Finding relevant research and analytics, test drive an end to end solution of the services we provide.<\/p>\n

The Microsoft Academic Graph<\/h2>\n

Our approach to generating the data<\/em><\/strong><\/p>\n

At current release, the Microsoft Academic Graph (MAG) contains over 233 million publications and their related academic entities, e.g. authors, publishing venues, associated concepts, etc.\u00a0 Microsoft Research has the benefit of our partnership with Bing to crawl the internet to discover research content from around the world and update our graph regularly.\u00a0 Using the power of cloud computing, we have built a pipeline where Bing data and other sources are filtered and analyzed.\u00a0 Using AI techniques, we disambiguate, conflate, apply rank and taxonomy to entities into the graph as stated in our recent paper (opens in new tab)<\/span><\/a>: \u201cThe project pushes the boundary of machine cognition technology by deploying software agents trained with natural language understanding capabilities to continuously scavenge the Web for research artifacts and, from them, extract up-to-date academic knowledge into a graph based representation called Microsoft Academic Graph (MAG)<\/em>\u201d.\u00a0 Optimization of this pipeline has reduced the time to create and validate our graph over time as well.\u00a0 Currently, we can produce a new version of MAG every week.\u00a0 For a great overview of how MAG is laid out and built, see our recently published paper (opens in new tab)<\/span><\/a>.<\/p>\n

What is contained in the dataset and what makes it unique?<\/em><\/strong><\/p>\n

Web scale data collection<\/em><\/p>\n

As stated, our Bing partnership brings the knowledge of the entire web into our graph.\u00a0 Combining this with the power of cloud computing allows for rapid iteration over immense quantities of data (tens of billions of raw data points) using intelligent agents to objectively curate the graph.<\/p>\n

Advanced entity conflation and disambiguation<\/em><\/p>\n

Our agents are trained to understand and reason over partial and noisy information from documents in diverse data sources. They recognize and assemble semantic objects in the academic domain (e.g. scholarly publications, authors, affiliations, conferences, journals, and fields of study) into the cohesive and evolving knowledge graph (MAG).<\/p>\n

Word\/Phrase embeddings<\/em><\/p>\n

Paper citation networks are well known to be sparse, have human bias, and be \u201ccliquey\u201d as researchers often cite papers by their advisors, friends and peers. It is rare to see inter-disciplinary citations even though researchers in disparate disciplines are often solving the same underlying scientific problem. MAG mitigates sparsity and clique issues in the graph by enriching paper-to-paper links across disciplines via a paper similarity system. This system not only uses the citation graph but also the content of each paper via trained language embeddings, as outlined in this paper (opens in new tab)<\/span><\/a>. The word embedding and citation-based paper recommendations can be found here (opens in new tab)<\/span><\/a>, in MAG.<\/p>\n

Trained word embeddings are also used to generate embeddings for our fields of study, allowing us to quickly tag papers with relevant concepts based on their content. We provide the ability for users to tag their own text documents with our fields of study, using our trained language embeddings, as part of the Microsoft Academic Language Similarity API (opens in new tab)<\/span><\/a>. This API is made available to anyone upon request, alongside weekly MAG updates.<\/p>\n

Field of study tagging and taxonomy learning<\/em><\/p>\n

MAG is built and organized using field of study tagging and taxonomy learning allowing consumers of the graph the ability to sub-divide the data.\u00a0 This is done through concept discovery, concept-document tagging and concept hierarchy generation.\u00a0 A detailed explanation of this process is provided in our recent paper (opens in new tab)<\/span><\/a>.<\/p>\n

In MAG, the fields of study can be found in this stream (opens in new tab)<\/span><\/a> and their parent-child relationship can be found in this one (opens in new tab)<\/span><\/a>. The corresponding UMLS Ids and source URLs are available in this stream. (opens in new tab)<\/span><\/a><\/p>\n

See an example of using fields of study below.<\/em><\/p>\n

Predictive static ranking: Saliency<\/em><\/p>\n

MAG computes saliency using reinforcement learning (RF) to assess the importance of each entity in the coming years.\u00a0 As MAG sources contents from the Web, saliency plays a critical role in telling the difference between good and poor content.\u00a0 The RF algorithm is programmed to predict future citations.\u00a0 Based on the publication and citation activities surrounding the novel coronavirus, MAG has learned COVID-19 related articles are most likely to be cited in the coming years.\u00a0 See our recent blog post (opens in new tab)<\/span><\/a> for more details.<\/p>\n

Saliency is available in MAG as the \u201crank\u201d attribute (opens in new tab)<\/span><\/a>.<\/p>\n

Multi-sense similarities<\/em><\/p>\n

MAG is, in nature, a heterogeneous graph with different types of entities and relations; in which there exist various structural relations corresponding to different semantic similarities. For example, two fields of study can be similar in different senses, such as they might be often studied together (coappear in the same papers or venues) or cooccur with all types of entities in the graph. Therefore, we learn the multi-sense network representations for entities in MAG and make the Network Similarity (NS) package (opens in new tab)<\/span><\/a> publicly available. By using the NS package, we can reveal the most similar fields to \u201cCOVID-19\u201d and \u201cSARS-COV-2\u201d under different senses.<\/p>\n

See an example of Multi-sense similarities below.<\/em><\/p>\n

Examples<\/strong>:<\/p>\n

Impact of COVID-19 on the Computer Science Research Community (opens in new tab)<\/span><\/a> \u2013 Our teams\u2019 research into the impact COVID-19 may have on conferences, authors and the Computer Science field of study.\u00a0 (Source code coming soon).\u00a0\u00a0Source code examples: https:\/\/github.com\/microsoft\/mag-covid19-research-examples\/tree\/master\/src\/MAG-Samples\/impact-of-covid19-on-the-computer-science-research-community (opens in new tab)<\/span><\/a><\/em><\/p>\n

How I built a list of coronavirus-related research papers using the Microsoft Academic Graph (opens in new tab)<\/span><\/a> \u2013 Medium post by Adam Day outlining the process that was used to generate a list of COVID-19 related papers.<\/p>\n

Multi-sense network similarity examples:<\/strong><\/p>\n

We show the most similar fields of study to COVID-19 and SARS-COV-2 under three different senses in MAG, powered by the NS package. Take COVID-19 for example, under the \u201ccopaper\u201d sense, the top entities to it indicate the other fields that are discussed in COVID-19 publications, such as SARS-COV-2, H1N1, Ebola, Nipah, and MERS. Under \u201ccovenue\u201d, the most similar entities to COVID-19 are fields—e.g., Infectious Disease Epidemiolog, Index case (patient zero), Middle East respiratory syndrome coronavirus—that are also studied in the journals or conferences in which the COVID-19 publications are published. Finally, the \u201cmetapath\u201d sense represent that two fields are similar if they co-occur with all other types of entities—papers, venues, affiliations, and fields of study, under which, the most similar fields include Viral phylodynamics, Middle East respiratory syndrome coronavirus, Lassa fever, etc. Overall, we can observe different sets of similar fields under these three senses, revealing different perspectives about the focused entity, I.e., COVID-19 at this case.<\/p>\n

\"Multi-Sense<\/p>\n

 <\/p>\n

Source code example:\u00a0 https:\/\/github.com\/microsoft\/mag-covid19-research-examples\/tree\/master\/src\/MAG-Samples\/NetworkSimilaritySample (opens in new tab)<\/span><\/a><\/em><\/p>\n

Fields of study stamping examples:<\/strong><\/p>\n

Papers Similar to: The role of absolute humidity on transmission rates of the COVID-19 outbreak (opens in new tab)<\/span><\/a><\/p>\n

Microsoft Academic provides highly related papers as recommended reading to this paper even though this paper currently has no citations. It therefore uses only trained word embeddings and the content of this paper. Notice the ability of the system to pick up relevant papers based on the broad concept of \u201cclimate and how it affects transmission of viruses\u201d. Word embeddings allow the system to relate terms like \u201chumidity\u201d, \u201cclimate\u201d, \u201ctropical\u201d, and \u201cweather\u201d together as well as \u201cinfluenza\u201d, \u201ccoronavirus\u201d, and \u201cvirus\u201d.<\/p>\n

In the USQL sample code below, we show how Fields of Study are used along with publication title and abstract term matching to find papers about COVID-19.<\/p>\n

Source code for this example: https:\/\/github.com\/microsoft\/mag-covid19-research-examples\/tree\/master\/src\/MAG-Samples\/CoronavirusPapersSample (opens in new tab)<\/span><\/a><\/em><\/p>\n

 <\/p>\n

Microsoft Academic Knowledge Exploration Service (MAKES) \/ Project Academic API<\/h2>\n

MAKES was created in response to our customers request for a non-rate-limited version of our Project Academic Knowledge API.\u00a0 In its basic form MAKES is a self-hosted REST API (opens in new tab)<\/span><\/a> leveraging an index of all the entities in the Microsoft Academic Graph (MAG). By subscribing to MAKES, the required components are delivered to your Azure subscription when new versions of MAG are created; typically, once every 1 to 2 weeks. A provided tool can then be run and MAKES instances are automatically provisioned to your Azure account.<\/p>\n

Examples:<\/strong><\/p>\n

MAKES Documentation (opens in new tab)<\/span><\/a> \u2013 Documentation for self-hosting a MAKES API<\/p>\n

Project Academic Knowledge Documentation (opens in new tab)<\/span><\/a> \u2013 Documentation for the Microsoft hosted API<\/p>\n

Querying MAKES \/ Project Academic API to retrieve papers about coronavirus – The following query can be given to MAKES \/ Project Academic API to produce a list of papers associated with coronavirus.\u00a0 This selects publications based on the field of study group (coronavirus), family (coronaviridae), genus (betacoronavirus), species (SARS-COV-2\/COVID-19) or title\/abstract matches on those keywords.<\/p>\n

Or(Composite(F.FN==’coronavirus disease 2019′), Composite(F.FN==’severe acute respiratory syndrome coronavirus 2′), Composite(F.FN==’betacoronavirus’), Composite(F.FN==’coronaviridae’), Composite(F.FN==’coronavirus’), W=’coronavirus’, AW=’coronavirus’, W=’coronaviridae’, AW=’coronaviridae’, W=’betacoronavirus’, AW=’betacoronavirus’)<\/em><\/p>\n

 <\/p>\n

Microsoft Academic website<\/h2>\n

The Microsoft Academic website is updated when new versions of the graph are released, about once a week.\u00a0 The site is designed to provide top \u2018n\u2019 results of search queries through an easy to use interface.\u00a0 Some advantages to using our website would be our improved search functionality that allows for semantic interpretations and suggestions for your queries.<\/p>\n

\"Microsoft<\/p>\n

 <\/p>\n

In the example above, you can see results from a query for coronavirus desease 2019, the topic for the current novel coronavirus.\u00a0 You can see that we understand the topic from the query and populate a card to the right of the search results showing a description along with parent and related topics.\u00a0 By selecting a publication from the list you are taken to a details page for that publication showing any information we collected about the publication (publishing venue, authors, institutions, links that we have currently found to the document on the web any topics that have been tagged for the publication).\u00a0 In the bottom section of the page, we show any publications that are referenced, any publications that we have found to cite the publication and a tab for related publications that we have found.<\/p>\n

\"Publication<\/p>\n

We also provide analytics for each entity type in our graph (publications (opens in new tab)<\/span><\/a>, authors (opens in new tab)<\/span><\/a>, topics (opens in new tab)<\/span><\/a>, conferences (opens in new tab)<\/span><\/a>, journals (opens in new tab)<\/span><\/a>, and institutions (opens in new tab)<\/span><\/a>).\u00a0 Our analytics pages allow you to search for Topics and find the top 100 entities, trends data and an overview of the distribution of entity types in the graph.<\/p>\n

In Summary<\/strong><\/p>\n

The Microsoft Academic team is committed to providing the community with any data that can help stem the COVID-19 advance.\u00a0 We hope that this blog post has offered some guidance and it will be updated as appropriate when details or related information changes.<\/p>\n

\u00a0<\/strong><\/p>\n

Appendix A \u2013 Links to Microsoft Academic resources<\/strong><\/p>\n

Microsoft Academic Project (opens in new tab)<\/span><\/a><\/p>\n

MAG Documentation and example code (opens in new tab)<\/span><\/a><\/p>\n

MAKES Documentation and examples (opens in new tab)<\/span><\/a><\/p>\n

Project Academic Documentation and examples (opens in new tab)<\/span><\/a><\/p>\n

White House Office of Science and Technology Policy (opens in new tab)<\/span><\/a><\/p>\n

 <\/p>\n

Appendix B – Updates to this post<\/strong><\/p>\n