Microsoft Academic Articles http://approjects.co.za/?big=en-us/research/ Thu, 24 Jun 2021 18:45:06 +0000 en-US hourly 1 https://wordpress.org/?v=6.6.2 Next Steps for Microsoft Academic – Expanding into New Horizons http://approjects.co.za/?big=en-us/research/articles/microsoft-academic-to-expand-horizons-with-community-driven-approach/ Tue, 04 May 2021 16:55:10 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=743605 Editor’s note, June 4, 2021 – the post has been updated with a more extensive FAQ to provide more details on the changes announced May 4. For over seven years, Microsoft Research has been proud to have one of its AI research projects contribute to the open exchange of knowledge within the research community. We […]

The post Next Steps for Microsoft Academic – Expanding into New Horizons appeared first on Microsoft Research.

]]>
Editor’s note, June 4, 2021 – the post has been updated with a more extensive FAQ to provide more details on the changes announced May 4.

For over seven years, Microsoft Research has been proud to have one of its AI research projects contribute to the open exchange of knowledge within the research community. We are now evolving our focus to explore how we can advance these AI technologies in Microsoft 365 to empower every person and organization to derive valuable insights from their content.

We remain confident in open and community-driven alternatives to MAS and are pleased to see the recent momentum across the academic ecosystem. Many of our open-source machine learning algorithms and annotated data repositories are available to the community today, and we will continue to provide guidance to key partners throughout this transition.

Microsoft Academic has been on a mission to explore new ways to empower researchers and research organizations to achieve more. The research project is characterized by two sets of technologies: one that reads all the Bing-indexed web pages and organizes the most up-to-date academic knowledge into a knowledge base called Microsoft Academic Graph (MAG), and the other that performs semantic reasoning and inference to serve that knowledge through the Microsoft Academic search website and API. We are proud that these data and web services have been found useful in numerous research projects around the world, and excited to see more community-driven, public efforts emerge.

One question that we are asked frequently, though, is how the technologies powering Microsoft Academic can be used by institutions outside of academia to make organizational knowledge more discoverable and accessible. Over the years, we have openly shared some of the building blocks, such as the language and network similarity packages, and the core search engine MAKES.  With the continued progress in data access, we believe now is the right time to fully explore opportunities to extend this technology to new industries and transition to community approaches for academic research.

Microsoft Research will continue to support the automated AI agents powering Microsoft Academic services through the end of calendar year 2021. During this time, we encourage existing Microsoft Academic users to begin transitioning to other equivalent services. Below are just a few of the many great options available to the community.

Thank you very much for the years of support and encouragement. We are immensely grateful to have learned and grown from your feedback over the years. As we are passing the torch to the community-driven efforts, we invite you to join us in continuously contributing ideas and suggestions to nurture, embrace, and grow these platforms.

FAQ on Microsoft Academic

Q: What is happening to Microsoft Academic Services (MAS)?
A: Microsoft Research set out to demonstrate AI-curated knowledge can effectively assist people in making serendipitous discoveries and deriving valuable insights. After seven years of developing the machine reading technology and working with the research community, we have chosen to embrace a community-driven approach within academia and now turn our focus to exploring ways we can extend this technology to even more people and organizations. This AI research project will be supported until the end of calendar year 2021, upon which time MAS will be retired.

What this means for each service:

  • Microsoft Academic Website: No longer accessible after Dec. 31, 2021
  • Project Academic Knowledge: No longer accessible after Dec. 31, 2021
  • Microsoft Academic Graph/Microsoft Academic Knowledge Exploration Service: No longer providing updated data or access to old releases after Dec. 31, 2021; however, existing copies can still be used under license.

Q: Why is Microsoft retiring MAS?
A: Microsoft Research developed MAS in response to feedback from our colleagues that the inequality in accessing large datasets presented a significant obstacle to conducting research and cultivating academic talents in the areas of Big Data and AI. With MAS, Microsoft Research has been proud to contribute to a culture of open exchange and a growing ecosystem of collaborators. As this research project has achieved its objective to remove the data access barriers for our research colleagues, it is the right time to explore other opportunities to give back to communities outside of academia.

Q: What will happen to the customers using the research service?
A: Customers are welcome to continue their use of MAS following the same data licenses and terms of use until the end of calendar year 2021

Q*. Can customers pay to keep MAS running indefinitely?
A. The decision for the team to move on from MAS is not based on the operational but on the opportunity cost. We recognize the core mission of MAS, to have intelligent agents gather knowledge and empower humans to gain deeper insights and make better decisions, are not only useful to the academic but also to all modern workers and students. Expanding our scope to areas beyond academic contents, particularly in the enterprise and the educational settings where our work can serve many orders of magnitude more users, is a tremendous opportunity with exciting challenges. Besides, the momentum is gaining on an open and community-driven alternative to MAS. We expect a few will be available by the end of the year, and the first public announcement of a MAG replacement and this announcement have just bolstered the confidence in this assessment.

Q*. Can you open-source components that create MAS?
A. The portion of the software that implements machine learning algorithms has been publicly disclosed in detail . Additionally, for modern machine learning, a large amount of annotated data can arguably be more valuable than software for implementing known algorithms. MAG has published such annotated data, in some advanced cases, with the confidence scores describing the qualities of the annotations. We have open-sourced examples of leveraging MAG annotations in modern Python-based machine learning frameworks, for example, a graph representation learning approach called HGT at this GitHub repository and an advanced topic recognition algorithm called MATCH here.

Q*. Can you provide a high-level sketch of the software architecture for MAS?
A. We have shared with our partners this deck that visualizes the architecture described in our published articles (links in the deck). We hope you find the illustrations helpful as well.

The post Next Steps for Microsoft Academic – Expanding into New Horizons appeared first on Microsoft Research.

]]>
Visualizing the Topic hierarchy on Microsoft Academic http://approjects.co.za/?big=en-us/research/articles/visualizing-the-topic-hierarchy-on-microsoft-academic/ Fri, 20 Nov 2020 19:11:21 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=706696 Today we have released a novel way to visualize and explore the topic hierarchy on the Microsoft Academic website.  The Microsoft Academic Graph (MAG) uses fields of study, exposed as Topics on the Microsoft Academic website, to categorize entities.  These fields of study are hierarchical in nature, grouping specific fields of study under larger, more […]

The post Visualizing the Topic hierarchy on Microsoft Academic appeared first on Microsoft Research.

]]>
Today we have released a novel way to visualize and explore the topic hierarchy on the Microsoft Academic website.  The Microsoft Academic Graph (MAG) uses fields of study, exposed as Topics on the Microsoft Academic website, to categorize entities.  These fields of study are hierarchical in nature, grouping specific fields of study under larger, more generic fields of study.  Over the past year we have thought a lot about how to help our users visualize our topic hierarchy and the connections within it.

Entity Analytics is a big part of the Microsoft Academic website.  On our analytics pages, we show statistics of each entity in our graph as well as rankings and trends.  However, navigating through the 700K+ topics can be a difficult task.  In previous releases we created a Topic browser control that allows you to search for topics you are interested in.  Once a desired topic is selected, parent and child topics are displayed to help users understand the hierarchical nature of topics in our graph and navigate between them.  This control fits its basic purpose but does not give the user perspective on the scale of topics available and how they are connected.

We feel that a visual representation of our topic hierarchy can give our users better context.  Given that topics in Microsoft Academic can have many parents and children, seeing these relationships in a directed graph brings perspective to their structure.  It also brings a bit of fun to exploring the topic graph as well.  In its default state, the topic graph explorer shows all the top-level topics and allows you to expand down the graph.  Nodes are color coded to the level of the hierarchy in which they appear and sized based on the number of publications contained in them.

As you navigate around the graph, child and parent nodes are drawn and connected.  Depending on which entity analytics page you are viewing, information about entities within that topic will be shown.  Below is a zoomed-out view of the Optics topic and its relationships on the author analytics page.  On the left, you can see its parent relationship with Physics and its shared child topics with both Computer Vision and Algorithm.  On the right, you can see detailed information the topic and the authors within.

Expanded 'Optics' node

When expanding nodes you can see their relationships and detailed information.

The topic graph explorer control also allows you to quickly find the topic you are interested in through the search feature.  The search feature not only searches through all the topics in MAG, but also shows you matches to topics that exist in the visual graph you have created.  Once a topic is selected, it is added to the visual graph where you can explorer related topics by expanding nodes up and down the hierarchy.

Topic search

Search for topics in either the visual graph or all of MA

There are two ways to engage with this new feature. By navigating to any of the entity analytics pages you can then select the “Topic Graph Explorer” tab.  The entity analytics pages can be access by click on an entity type in the entity statics bar on the home page.

Entity statistics

Click on entity statistics to see analytics

Or, if you are viewing a topic from it’s details page, you can click on the ‘Explore’ button and see entity analytics based on that topic.

Explore a topic

Explore a topic from its details page

We believe this feature brings a new perspective to our topic graph while adding a bit of fun as well.  As always, if you have feedback on this feature or any other feature on the Microsoft Academic website, please reach out through the feedback control on the lower right on our website.

The post Visualizing the Topic hierarchy on Microsoft Academic appeared first on Microsoft Research.

]]>
Expanding Semantic Search into Biomed with Medical Subject Headings (MeSH) http://approjects.co.za/?big=en-us/research/articles/expanding-semantic-search-into-biomed-with-medical-subject-headings-mesh/ Wed, 21 Oct 2020 17:01:55 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=699412 Microsoft Academic users can now explore biomedical publications using Medical Subject Heading (MeSH) terms in semantic search.

The post Expanding Semantic Search into Biomed with Medical Subject Headings (MeSH) appeared first on Microsoft Research.

]]>
We’re excited to announce that Microsoft Academic (opens in new tab) (MA) users can now explore biomedical publications using Medical Subject Heading (opens in new tab) (MeSH) terms in semantic search.

MeSH is a controlled and hierarchically organized vocabulary that the National Institute of Health (opens in new tab) (NIH) maintains for indexing, cataloging, and facilitating search in biomedical databases such as PubMed (opens in new tab). Since releasing the new version of MA nearly 5 years ago, we have increasingly observed that many user queries are phrased using MeSH terminology. That observation coupled with the prevalence of biomedical literature in the Microsoft Academic Graph (opens in new tab) (MAG) led us to pursue the integration of MeSH into MA’s unique semantic search capabilities.

Users on MA can now access MeSH using two new semantic attributes: MeSH descriptors (opens in new tab) signified by , and MeSH qualifiers (opens in new tab) signified by

Query suggestions for the query

Revisiting semantic search

One of the core differentiating behaviors of MA has always been its emphasis on semantic search. In contrast to keyword search where a search engine performs best when users select the “right” keywords that match how the contents are indexed, semantic search is designed for the cases when it is not clear what the “right” keywords should be. For example, suppose you want to find the most influential publications in artificial intelligence (opens in new tab) (AI). Using the query “artificial intelligence” with a keyword-based search engine, you will get results where the query terms explicitly appear in the paper title/body, which misses the influential publications on AI that do not contain those specific terms. A semantic search engine like MA, on the other hand, will be able to overcome this limitation.

As of the time of writing, the top results for the query “artificial intelligence” on MA are articles that demonstrate the efficacy of deep convolutional neural networks for computer vision. These trend setting articles do not include “artificial intelligence” anywhere in their titles, abstracts, or even in the full text body and hence will not be retrieved by keyword search unless additional field of study annotations are also indexed as keywords.

However, there are scenarios where a more intelligent search behavior cannot be so easily addressed, which is where our semantic search truly shines.

What are composite attributes?

Composite data relationships are one such example. In a world where talents can move from one institution to another, it is common to see authors with publications affiliated with different institutions. In the meantime, authors can also collaborate with others from their previous affiliations. The query consisting of an author and an institution can therefore be interpreted as to find either the work of the author while affiliated with the institution, or the collaborative work this author has with the said institution. We can distinguish these two different meanings by modeling the author-affiliation relationship as a composite attribute of a publication. Our API users have always been able to express this nuanced intent using the composite query function (opens in new tab), and we are now making the same capability available to our website users.

Take the Turing Award winner Yann LeCun (opens in new tab) as an example. As a renowned computer scientist, he has had a productive career through AT&T Bell Labs (opens in new tab), Courant Institute in New York University (opens in new tab) and, most recently, Facebook (opens in new tab). Previously, MA treated the query “Yann LeCun New York University” by lumping the search results of both interpretations together. MA users can now use “Yann LeCun while at New York University” to more narrowly scope search to only include papers written while the author was affiliated with New York University. As the goal of semantic search is to zoom in on the most relevant result, being able to express more precise intent can help quickly filter the massive result sets that a keyword search engine would produce. For example, MA will only serve up one result to the query “Yann LeCun while at New York University Bell Labs” where another Bell Labs researcher coauthored a paper with Yann LeCun. All the papers Yann published while he worked at Bell Labs are not included in the search results as shown below (Note: be sure to engage with the query suggestion as explained in MA FAQ (opens in new tab)):

Search results for query

Similarly, the query “Yann LeCun Bell Labs” is now treated as an ambiguous query and will prompt MA to help the user clarify their intent with disambiguating query suggestions:

Query suggestions for

MeSH as a composite attribute

Composite attributes provide a powerful mechanism to group concepts that should be processed together, and one area that can further demonstrate its efficacy is in handling Medical Subject Headings (MeSH) (opens in new tab).

In the MeSH implementation now available on MA, two basic types of MeSH records are included: the descriptor (aka main heading) and the qualifier (aka subheading). Descriptors characterize the subject matter or content of an article, while qualifiers are used in connection with descriptors to define a particular aspect of a subject.

A good way to understand the differences between descriptors and qualifiers and our rationale to keep them as distinct fields in a composite attribute is through terms that can play either role. Take “mortality” as an example. MA can now differentiate the dual roles this term can play directly in the query suggestion dropdown where a darker/lighter icon is used for a descriptor/qualifier, respectively:

Query suggestions for

Clicking on the fourth suggestion to instruct MA to interpret “mortality” as a descriptor, one can see (from the “Top Topics” on the left rail of the search result page) that research on this subject commonly co-occurs with topics in “demography”, “population” and “public health”.

Search results for query

Further down the search result page are new sections for top co-occurring MeSH descriptors, where we can see that mortality is typically studied with other subjects like sex (male vs female), age, and geography. Similarly, by looking into top related MeSH qualifiers, MA shows the research articles addressing the subject of mortality are commonly from the areas of epidemiology or etiology, and the top topics include mortality trends and prevention control:

Search result filters for query

In contrast, when asking MA to interpret “mortality” as a qualifier, we can see “mortality” is often an aspect in “internal medicine”, “surgery”, “cardiology” or “cancer” research. Take heart attack (MeSH descriptor “myocardial infarction”) as an example. As MA can now show, this area of research can be studied through many aspects, including “mortality” but also others ranging from “drug treatment” to “complications”:

Query suggestions for

In this example, if you want to focus on articles about the mortality rate of heart attacks, you can select the first query suggestion “myocardial infarction in relation to mortality”. On the subsequent search result page all the top-most results will match the “myocardial infarction/mortality” descriptor/qualifier pair, indicated by the highlighted tag as

or

One important item to note here is the presence of the “*”, which is a MeSH convention to annotate the “major topic (opens in new tab)” for an article. This major topic flag is used in MA as one of the many signals in determining search result rankings. However, because search rankings are influenced by many factors, it is possible that an article whose major topic matches the query perfectly is ranked lower than others whose major topics are not as tightly matched.

Moving back to query formulation, similar to the author/affiliation example showcased above, when encountering the ambiguous query “heart attack mortality” MA will now generate two suggestions that reflect distinct interpretations:

Partial query suggestions for

The first interpretation generates results explicitly about the mortality of heart attacks. The second query suggestion, however, reflects a larger set of results with articles about the mortality rate for diseases (not specifically heart attacks) but also mentioning heart attacks (e.g. as a preexisting condition). To put it another way, the first interpretation is more specific and the second less specific.

As with author/affiliation metadata, modeling MeSH concepts with composite attributes enables this behavior in semantic search. It also enables descriptor/qualifier values to be queried independent of each other.

As MeSH concepts overlap significantly with MA’s existing topics, we’ve also provided new scoping triggers for MeSH so that queries can be more precisely specified:

Scope Description Example
mesh: Match MeSH descriptor and/or qualifier mesh: heart attack
mesh: mortality
mesh: heart attack mortality
mesh: heart attack in relation to mortality
mesh descriptor Match MeSH descriptor mesh descriptor heart attack
mesh qualifier Match MeSH qualifier mesh qualifier diagnosis
abstract: Match term or quoted value from the paper abstract abstract: “heterogeneous entity graph comprised of six types of entities”
affiliation: Match affiliation (institution) name affiliation: “microsoft research”
author: Match author name author: “darrin eide”
conference: Match conference series name conference: www
doi: Match paper Document Object Identifier (DOI) doi: 10.1037/0033-2909.105.1.156
journal: Match journal name journal: nature
title: Match term or quoted value from the paper title title: “an overview of microsoft academic service mas and applications”
topic: Match paper topic (field of study) topic: “knowledge base”
year: Match paper publication year year: 2015

 

In closing, we are excited about the addition of MeSH to MA, and the opportunities it enables with the research community. As always, we love getting feedback and try to respond to as much of it as possible. To provide feedback, navigate to Microsoft Academic (opens in new tab) and click the “feedback” icon in the lower right-hand corner.

Happy researching!

The post Expanding Semantic Search into Biomed with Medical Subject Headings (MeSH) appeared first on Microsoft Research.

]]>
Visualizing academic impact http://approjects.co.za/?big=en-us/research/articles/visualizing-academic-impact/ Wed, 01 Jul 2020 21:15:41 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=670875 We're excited to announce that starting today Microsoft Academic users have a new way of visualizing academic impact. This feature, available on author, conference, journal, institution and topic detail pages, provides a visualization of the impact an entity has in relation to other types of entities. For example, this allows you to see which topics an author has most impacted with their publications, or which institutions have published the most impactful work in a specific journal.

The post Visualizing academic impact appeared first on Microsoft Research.

]]>
We’re excited to announce that starting today Microsoft Academic users have a new way of visualizing academic impact.   This feature, available on author, conference, journal, institution and topic detail pages, provides a visualization of the impact an entity has in relation to other types of entities. For example, this allows you to see which journals or conferences an author has most impacted with their publications, or which institutions have published the most impactful work in a specific journal.

Before we jump in, it’s important to define Impact and how it’s measured.

When our graph is built, each entity is evaluated along with its connected entities and an individual rank is assigned.  This is done based on a few factors, including the rank of its connected entities through references or citations.  In this way, rank is not solely driven by citation count, but also by the rank of an entity’s connections in the graph. Additionally, it is known that citations are a lagging indicator of impact because it takes time for research to be duly recognized and its impacts to be fully appreciated, leading to an age bias favoring older work. To adjust for this bias, we have employed a reinforcement learning algorithm (opens in new tab) that utilizes the massive historical data we have to train a ranker that recognizes the momentum of new publications and projects their future impacts. This way, newer work is not at a disadvantage when comparing to older work. We refer to this rank as an entity’s ‘Saliency’ (for a more detailed description of Saliency, see recent our paper (opens in new tab)).  The “top” entity relationships we previously showed on entity pages already reflected impact through saliency.

It is common for an author or an institution to achieve higher impact by being prolific. Saliency ranks are therefore often conflating productivity and the impact of individual publications. While aggregate rank of impact is useful, it is also interesting to take into consideration productivity and ask; “What volume of work was done to achieve a given rank?”, and  “what is the average per-article impact?”.  To show this, we show the publication normalized saliency by using a feature in MAG called paper families to properly count the number of articles that should be regarded as a single publication.  Paper families are a grouping of papers that we have found to be identical, or nearly identical, which have been published in different venues.  Take for instance a paper an author has written that has been published in a pre-print repository, a conference and a journal.  We record each of these publications as separate entity’s in the graph, but these publications represent the same work and are thus grouped into a paper family in MAG.  Using this value for an entity’s publication count, we normalize the saliency and determine an entity’s productivity.

Author impact chart for the University of Washington

Author impact chart for the University of Washington

The chart above shows the most impactful authors at the University of Washington (opens in new tab).  As you can see, Christopher J. Murry (opens in new tab)  has the highest saliency rank.  However, the author with the second highest saliency rank, Mohsen Naghavi (opens in new tab) has a higher productivity (publication normalized rank), 2th overall.  Mohammad H. Forouzanfar (opens in new tab), with only a few publications relative to peers, has really made an impact with the work they have published, ranked 1st among the top 20 at this institution.

By default, the charts display overall rankings.  If you would like to dig further, we also provide contextual year and topic filters which allow you to drill down and build a custom view.  You may also be curious what publications these authors have written to achieve these rankings.  If you would like to see them, simply click on the authors name and you will be taken to a search results page to view them.

Presenting real-time analytics has been a goal of our team this year. This feature is another great demonstration of how the Microsoft Academic Graph (MAG) (opens in new tab) paired with the Microsoft Academic Knowledge Exploration Service (MAKES) can be used to gather bibliometric data and tell the story behind research in real-time.

If you have any questions or comments on this feature or other features on our website, please reach out by using the feedback tab at the bottom right of our site (opens in new tab).

The post Visualizing academic impact appeared first on Microsoft Research.

]]>
Rationalizing Semantic and Keyword Search on Microsoft Academic http://approjects.co.za/?big=en-us/research/articles/rationalizing-semantic-and-keyword-search-on-microsoft-academic-2/ Fri, 22 May 2020 01:07:35 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=661779 Discussion of new changes to Microsoft Academic search, including expanded keyword search, phrase support, abstract search and more

The post Rationalizing Semantic and Keyword Search on Microsoft Academic appeared first on Microsoft Research.

]]>
Over the past 6 months we’ve been experimenting with a host of changes to Microsoft Academic’s search experience, and now that the last of those experiments has shipped we’re excited to finally discuss them.

Before we jump in, if you’re interested in a deeper technical analysis of the new capabilities please review the following resources:

No room for interpretation?

From the initial release of Microsoft Academic in 2016, up until 6 months ago, our semantic search algorithm focused on generating results that best matched semantically coherent interpretations of user queries, informed by the Microsoft Academic Graph (MAG) (opens in new tab).

To better explain, let’s examine the query “covid-19 science”. Traditional search engines based on keyword search (i.e. Google Scholar, Semantic Scholar, Lens.org, etc.) do an excellent job of retrieving relevant results that have keyword matches for “covid-19” and variations of “science” (science, sciences, scientific, etc.) Our system, however, prefers to interpret “covid-19” as a shorthand reference (synonym) of the topic “Coronavirus disease 2019 (COVID-19)” (opens in new tab) and “science” as the journal “Science” (opens in new tab) because MAG suggests this interpretation will turn up more highly cited and relevant papers than treating the query as simple paper full-text (title/abstract/body) keywords. This distinction is important, as it allows our semantic search algorithm to leverage semantic inference to retrieve seminal publications that do not strictly contain “covid-19” as keywords, yet are nevertheless relevant and important.

Regardless, we still previously allowed for rudimentary keyword matching, namely, prefix and literal unigram matching of publication titles (with no support for stemming or spelling corrections). Unfortunately, the outcome of this limited keyword matching was frequently encounters with the dreaded “no results” page.

For example, assume you were looking for a paper that you thought was named “heterogeneous network embeddings via deep architectures”. Entering this phrase as a query would result in no suggestions and an error page if executed on the site:

No search results

This is a classic case of users knowing what they want but having difficulty getting an algorithm to understand. A common problem with keyword search is it puts the burden of choosing the “right” keywords for a query squarely on the shoulder of the user.

Now with our newest search implementation this same query will work exactly as intended:

Paper search result with dropped term

To understand why this now works we first need to explain how our semantic search implementation works.

Ok, maybe a little room for interpretation

To put it simply, we’ve changed our semantic search implementation from a strict form where all terms must be understood to a looser form where as many terms as possible are understood.

The formulation of semantic interpretations (as explained above) remains unchanged, in that the knowledge in MAG still plays the central role in guiding how a query should be interpreted. What has changed is that when a portion of a query is thought to refer to full-text properties (i.e. title, abstract), the algorithm can now dynamically switch to a new scoring function that is more appropriate than literal unigram matching and hence less brittle as the example above shows.

Going a bit deeper, let’s define what “as many terms as possible are understood” means. By its nature, loose semantic query interpretation will produce interpretations with the highest coverage first and fastest, and as interpretations with less coverage (i.e. terms are dropped from consideration) are generated the relevance and speed decrease. The reasons for this are technical and have to do with the search space growing exponentially as the query considered becomes less specific. So in practice “as many as possible” is better defined as “as many as possible in a fixed amount of time”.

This means that factoring in variables such as query complexity and service load, the results generated from a fixed timeout where terms are more loosely matched (aka the result “tail”) could vary between sessions. However because the interpretations with highest coverage are generated first, the results they cover (aka the “head”) are very stable.

While this change is a great remedy for queries with full-text matching intent, the loosened interpretation does also impact semantic search results as they are no longer as concise as before due to a longer result “tail” that includes full-text matches.

As always, an example speaks a thousand words:

Query formulation

BEFORE

Show results matching top interpretations where all query terms are understood, ranked only by paper salience (static rank, aka importance)

AFTER

Show results matching top interpretations where as many query terms as possible are understood, ranked first by number of terms matched then by paper salience

Let’s take a closer look at the new “loose” semantic search algorithm, as it comes with a new user interface that illustrates how each search result is understood in the context of the user query:

As mentioned earlier, results are first ranked based on the number of query terms matched. In this case the first result matched all query terms and takes the top spot even though it has a lower static rank (and citation count) than the following two results. Another important item to call out is that when query terms are matched using synonyms, the synonymous terms are shown in parenthesis next to the canonical form, e.g. the user typed “z shen” but it was matched to “zhihong shen”.

 

Here we can see the new semantic search results are based on “loose” interpretations. In both cases, the query terms “acl 2018” were not understood in the context of the result, and were shown as crossed out while the other terms maintain the same semantic understanding as the first result. Additionally, both results have a higher static rank than the first result but are ranked lower because they match less of the query.

 

As we look farther into the tail of results we can see how much of the query can be dropped (in this case 4 of the 8 query terms).

 

Matching phrases

Historically Microsoft Academic has support for matching queries to values in a few different ways:

  • Matching exact values, e.g.
    “a web scale system for scientific knowledge exploration” => “a web scale system for scientific knowledge exploration”
  • Matching the beginning of values (aka prefix completions, only available as query suggestions), e.g.
    “a web scale system for scientific” => “a web scale system for scientific knowledge exploration
  • Literally matching words from the value, e.g.
    “microsoft academic overview” => “an overview of microsoft academic service mas and applications

In addition we now support a new form of partial value matching based on phrases. This is a common feature frequently seen in keyword search, where query interpretation prefers interpretations with closer term proximity. For example, comparing results for the query “deep learning brain images” based on simple word matching and phrase matching:

Top 5 papers using word matching, where results are based on matching words and ranking based on paper static rank:

  • Classification of CT brain images based on deep learning networks
    (Static rank = -18.994, Distance = 4)
  • Unsupervised Deep Feature Learning for Deformable Registration of MR Brain Images
    (Static rank = -19.036, Distance = 8)
  • Application of deep transfer learning for automated brain abnormality classification using MR images
    (Static rank = -19.305, Distance = 10)
  • Age estimation from brain MRI images using deep learning
    (Static rank = -19.727, Distance = 6)
  • Exploring deep features from brain tumor magnetic resonance images via transfer learning
    (Static rank = -20.06, Distance = 13)

Top 5 papers using phrase matching, where results are based on first matching words and then re-ranking based on edit distance (opens in new tab) between query and value (ignoring stop words):

  • Deep Learning on Brain Images in Autism: What Do Large Samples Reveal of Its Complexity?
    (Static rank = -20.372, Distance = 0)
  • Deep learning of brain images and its application to multiple sclerosis
    (Static rank = -20.534, Distance = 0)
  • Classification of CT brain images based on deep learning networks
    (Static rank = -18.994, Distance = 4)
  • Unsupervised Deep Feature Learning for Deformable Registration of MR Brain Images
    (Static rank = -19.036, Distance = 8)
  • A deep learning-based segmentation method for brain tumor in MR images
    (Static rank = -20.171, Distance = 6)

This new ability to re-rank based on query-value edit distance also allows us to support quoted phrases in queries:

The rules for quoted values are:

  • A quoted value can only be matched to a single field, i.e. title, author name, journal name, etc.:
    Works: “deep learning” (matches field of study)
    Works: “microsoft research” (matches affiliation)
    Doesn’t work: “deep learning microsoft research”
  • For attributes that support partial matching (title, abstract), all quoted words must have a term-based edit distance (opens in new tab) of zero, ignoring stop words (opens in new tab):
    Works: “deep learning brain images”
    Doesn’t work: “brain deep images learning”
  • Queries can contain multiple quoted values, each being evaluated using the rules defined above:
    Works: “deep learning” “microsoft research”
  • A quoted value is treated as a single query term and can be dropped accordingly based on the new search algorithm:
    Doesn’t work: “deep learning at microsoft research rocks!”
    Works: deep learning “at microsoft research rocks!”
  • All terms in a quoted value are normalized in exactly the same fashion (opens in new tab) as non-quoted terms

Support for searching paper abstract

We have finally added support for a long requested feature: searching paper abstracts! This is an important addition that significantly expands the reach of our partial-term matching for papers.

Abstracts are treated like all other semantic values, meaning they can be matched implicitly or explicitly using the “abstract:” scope, e.g.:

  • title: “microsoft academic” abstract: “heterogeneous entity graph”
  • “microsoft academic” “heterogeneous entity graph”

Scoped queries

Microsoft Academic has always supported query “hints” that require subsequent terms to match a specific attribute, i.e. the classic “papers about ”, but with our most recent release we now also support colon delimited scopes.

The rules for scopes are simple: the query term immediately after the scope must be matched with that scopes attribute type. A query “term” is defined as a single word or a quoted phrase. For example, if you wanted to match papers with “heterogeneous”, “entity” and “graph” in their abstracts but didn’t care about them being part of a sequence you would issue the query “abstract: heterogeneous abstract: entity abstract: graph”.

Supported scopes and their corresponding triggers:

Scope Description Example
abstract: Match term or quoted value from the paper abstract abstract: “heterogeneous entity graph comprised of six types of entities”
affiliation: Match affiliation (institution) name affiliation: “microsoft research”
author: Match author name author: “darrin eide”
conference: Match conference series name conference: www
doi: Match paper Document Object Identifier (DOI) doi: 10.1037/0033-2909.105.1.156
journal: Match journal name journal: nature
title: Match term or quoted value from the paper title title: “an overview of microsoft academic service mas and applications”
topic: Match paper topic (field of study) topic: “knowledge base”
year: Match paper publication year year: 2015

 

Feedback welcome

These changes have been in the works for over 6 months, and as always we’d love to hear your feedback, be it suggestions, critiques, bug reports or kudos. To provide feedback, navigate to Microsoft Academic (opens in new tab) and click the “feedback” icon in the lower right-hand corner.

Stay tuned in the coming weeks for another search-oriented post about how you can accomplish reference string parsing using Microsoft Academic Services!

The post Rationalizing Semantic and Keyword Search on Microsoft Academic appeared first on Microsoft Research.

]]>
Introducing the Microsoft Academic Knowledge Exploration Service (MAKES) V2 http://approjects.co.za/?big=en-us/research/articles/introducing-the-microsoft-academic-knowledge-exploration-service-makes-v2/ Tue, 21 Apr 2020 20:57:59 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=651498 Today we are happy to announce the availability of MAKES version 2, a private, self-hosted version of the popular Project Academic Knowledge (PAK) API.  As we hear feedback from our customers, the most requested feature for our free to use service, Project Academic Knowledge (PAK) API, is the ability to increase the threshold for monthly requests.  MAKES API v2 users to self-host PAK API instances on their own Azure subscription, removing usage limitations.  With this release, data schema and implementation are nearly interchangeable between PAK and MAKES.  Porting an existing application using PAK is straightforward; just deploy MAKES V2 to your Azure subscription and update the PAK endpoints in your application to your MAKES V2 deployment.  MAKES V2 is built on a flexible architecture using standard Azure technologies, allowing you to choose the size of your instance(s) and throughput based on your needs.  MAKES V2 subscriptions are aligned with the most recent data from the and are provisioned at the same cadence as new versions of MAG are provisioned, generally once a week.

The post Introducing the Microsoft Academic Knowledge Exploration Service (MAKES) V2 appeared first on Microsoft Research.

]]>
Today we are happy to announce the availability of MAKES version 2, a private, self-hosted version of the popular Project Academic Knowledge (PAK) API.  As we hear feedback from our customers, the most requested feature for our free to use service, Project Academic Knowledge (PAK) API, is the ability to increase the threshold for monthly requests.  MAKES API v2 users to self-host PAK API instances on their own Azure subscription, removing usage limitations.  With this release, data schema and implementation are nearly interchangeable between PAK and MAKES.  Porting an existing application using PAK is straightforward; just deploy MAKES V2 to your Azure subscription and update the PAK endpoints in your application to your MAKES V2 deployment.  MAKES V2 is built on a flexible architecture using standard Azure technologies, allowing you to choose the size of your instance(s) and throughput based on your needs.  MAKES V2 subscriptions are aligned with the most recent data from the Microsoft Academic Graph (MAG) and are provisioned at the same cadence as new versions of MAG are provisioned, generally once a week.

Beyond parity with PAK, MAKES V2 allows you to build custom indexes containing only the information you want from MAG, empowering your solution to surface only the data that you require.  And soon, MAKES V2 will allow you to combine the data in MAG with data that you provide, as well as allow you to create custom entity schemas and customize the language grammar powering the interpret API to create truly custom solutions.

What is the Microsoft Academic Knowledge Exploration Service (MAKES)?

MAKES API’s are designed to deliver top-N results from MAG, giving you the ability to create dynamic real-time knowledge applications.  For example, you can create interactive websites like our Microsoft Academic website, real-time analytics applications like VOS Viewer, interactive dashboards in Power BI or federate your existing search capabilities with publications, authors, institutions, journals and conferences in MAG.  For more information, see our introduction documentation.

Aligned with the Microsoft Academic Graph (MAG) and Project Academic Knowledge (PAK)

The MAKES V2 API’s themselves have always closely mirrored our PAK API’s in their interface but up until this point the schemas and implementations have differed slightly.  MAKES V2 brings these two offerings to parity in interface and data schema to make the transition from our free service to our MAKES V2 Azure self-hosted service as seamless as possible.  Going forward, a single tenant of the MAKES API’s will be to maintain parity as well to ensure an easy transition.  In this way, customers of PAK can use MAKES to scale up and out to meet their needs.

Flexible architecture

When designing MAKES V2 we focused on two areas: simplicity and scalability.  MAKES V2 is built on top of proven Azure technologies which make it easy to deploy, maintains and scale.  Using a simple management tool supplied with your subscription, you can deploy a single instance or scale to multiple instances in multiple regions.  In this release we have also added the ability to create and deploy custom indexes when paired with a MAG subscription, allowing you to scale to only the data your applications and users require.

MAKES Architecture example

An example architecture for a MAKES API deployment

Index only what you need and add what you want

MAKES V2 moves beyond the PAK API’s in one important way, the ability to change what is indexed and delivered through the API’s.  With our tutorials as a starting point, it’s easy to create and index custom subgraphs of MAG. The custom index tutorial shows how to generate a subgraph that only contains entities related to a given institution, a common scenario for customers.  Recently, we have partnered with AI2 and others to generate the CORD-19 dataset for COVID-19 research.  What if you could build a powerful semantic search engine over not just those documents, but also over the prior research and fields of study referenced in those publications?  We set out to do just that and will be sharing this with the community soon.

Over the next few months, we will be adding functionality and tutorials to MAKES that will allow you to bring private data into the index.  Many institutions and organizations have libraries of private publications; in the coming months we will be opening the platform and providing tutorials that will show you how to combine your private library data with the MAG graph to create solutions for your organization.

Advanced features and product Road Map

Private data

We will be publishing tutorials and features to MAKES that will allow you to combine MAG data and private data to create a custom index to power MAKES API’s.  As an example, you will be able to combine private data such as a library of publications or patents owned or created by your organization with MAG to surface them from MAKES API calls.

Custom schema

We will be introducing tutorials and features that will allow you to create custom schemas for the data returned from the MAKES V2 API’s.  As an example, research institutions would be able to add a new property on a paper entity to show the amount of funding associated with each publication.  Using private data to populate new properties can power interesting real-time applications to determine ROI on research projects.

Custom grammar

Both PAK and MAKES V2 Interpret API enables semantic interpretation of natural language queries using an SRGS grammar. We’re excited to announce that over the next few months we will be creating tutorials that teach you how to create and use custom grammars that enable a variety of different NLP scenarios.  As an example of modifying the grammar to enhance the interpret experience, you might want to support the ability to use key terms to narrow results.  To do this, a modification to the grammar could be made to recognize the term ‘published after’ , for instance, as constraint to limit results returned from the interpret API to publications created after the year specified in the query.  Here is an interpret query example that would be supported by this grammar change: “AI papers published after 2015”.  Grammars can also be generated to support non-English language queries, specific terms, etc…

Our team is excited about this new offering and we are looking forward sharing with you the new features coming out over the next few months.  As always, please contact us is you have any questions or feature requests.

The post Introducing the Microsoft Academic Knowledge Exploration Service (MAKES) V2 appeared first on Microsoft Research.

]]>
Changes to Microsoft Academic Services (MAS) During COVID-19 http://approjects.co.za/?big=en-us/research/articles/changes-to-microsoft-academic-services-mas-during-covid-19/ Tue, 07 Apr 2020 21:22:22 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=648381 In response to the rapid publications on COVID-19, Microsoft Academic Services have made several changes in the way scholarly articles are processed.

The post Changes to Microsoft Academic Services (MAS) During COVID-19 appeared first on Microsoft Research.

]]>
Update: An article on CORD-19 dataset is now available on arXiv (opens in new tab).

Since we answered the call from the White House (opens in new tab) and teamed up with our partners to release CORD-19 (opens in new tab) and MAS resources (opens in new tab) a few weeks ago, the scientific outputs on COVID-19 continue to grow at an amazing pace. In contrast to what we reported in this previous blog (opens in new tab), the COVID-19 (opens in new tab)/SARS-Cov-2 (opens in new tab) specific publications now look like the following based on the March-27 snapshot of the Microsoft Academic Graph (MAG):

COVID-19 papers in MAG

 

In the meantime, we continue to be in an “ultimate open science” era where all publishers have dropped their paywalls, expedited the peer review and publication, and some even have waived the article processing charges (APCs) on publications related to the pandemic. Most impressively, major publishers have granted text and data mining (TDM) rights on coronavirus related publications and agreed their CORD-19 distributions (albeit temporarily). As of April 6, 2020, there are more than 47K articles in the CORD-19 dataset mirrored at Kaggle (opens in new tab) and MIT (opens in new tab), among others, a rapid growth in comparison to the onset where only 13K out of 29K CORD-19 articles had full text contents.

These developments have necessitated a few new changes aside from what have been described in our previous blog (opens in new tab). First, starting this week, we have doubled our data update frequency from every other week to once each week. This is becoming necessary given the research communities are publishing more than 3500 articles a week on COVID-19 alone, as shown in the figure above. The faster pace of data update can be seen in MAG, MAKES (including the public REST API), and the Microsoft Academic website.

Secondly, the figure above is currently not reproducible using the publication dates reported by publishers. Instead, to understand when the contents are available for the research community to consume, we have found it necessary to use the online dates rather than the publication dates publishers prefer. For instance, there are papers reported as published in January 2020 but contain references to “COVID-19”, a term that was not decided by World Health Organization (WHO) until February 11, 2020. On the other hand, some journals have scheduled well into their September issue many COVID-19 articles that have already received citations by articles published in March 2020. All these forward references that should be rare but are exacerbated in recent months are a legacy in the publication industry that can use an update in the online era. Accordingly, we will add an “online publication date” to every article aside from the existing publication date reported by the publisher as soon as the new property passes our quality control evaluations.

Thirdly, as much as we are proud of the concept recognition capability (opens in new tab), we have to recognize the technology is not 100% perfect yet. This sample code (opens in new tab) on our GitHub page illustrates a way to conduct semantic search and keyword matching into MAG, and for the past several snapshots, the concept-only retrieval consistently covers only about 85% of the results. Additionally, MAG has yet to recognize all chemical compounds and pharmaceutical products, such as many drugs that were designed as treatments for other diseases but are being considered for COVID-19 clinical trials. To compensate for the 15% shortfall in semantic search and the missing concepts, we have quickly included rudimentary keyword search capabilities at Microsoft Academic website. Effectively immediately, the website users can search phrases in quotes (e.g., “novel coronavirus” ”china” (opens in new tab), very useful in finding COVID-19 papers before official terminologies were widely adopted) and expect such queries will retrieve articles with literal matches in the title or the abstract. Harmonizing the semantic and keyword search experience is not trivial, and we will have a separate blog on this subject in the coming week.

Finally, as a requirement for the CORD-19 dataset, we have taken as a credible source the WHO’s paper collection (opens in new tab) that, in addition to research articles written in English, includes news, commentaries and, most importantly, non-English publications that would have otherwise been excluded from MAG (see our recent article (opens in new tab) on this subject). A sizeable number of non-English articles included in WHO’s collection are from Chinese journals that provide high quality English translation on the title and the abstract. We have started working with our colleagues in China to develop scalable means to include these journals and their publications into MAG.

As for news and other non-scholarly articles, they are slipping through the principal component analysis because of their strong connections to the two fields of study COVID-19 or SARS-CoV-2. Based on this observation, we are excluding the relations to fields of study from being considered in the principal component analysis starting this week. A preliminary analysis indicates this algorithmic change can filter out more than a half million articles previously included in MAG, mostly from university websites that are not published in other peer review venues. We do not expect the removal of this type of articles to cause dramatic impacts on analytics based on MAG, but in the coming weeks, we will continue to monitor the effect of this new tweak and run additional experiments.

Suffice it to say this pandemic has profoundly impacted our lives in many ways. We hope this blog finds you safe and healthy, and happy researching!

The post Changes to Microsoft Academic Services (MAS) During COVID-19 appeared first on Microsoft Research.

]]>
Microsoft Academic resources and their application to COVID-19 research http://approjects.co.za/?big=en-us/research/articles/microsoft-academic-resources-and-their-application-to-covid-19-research/ Mon, 16 Mar 2020 19:13:33 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=643275 Given the circumstances surrounding the COVID-19 pandemic, we would like to provide an overview of the services that we provide, explain the focus of each and provide working examples on how to best use our data and to help generate insight from coronavirus-related scholarly communications.

The post Microsoft Academic resources and their application to COVID-19 research appeared first on Microsoft Research.

]]>
This post will be updated as examples and data are added

Given the circumstances surrounding the COVID-19 pandemic, we would like to provide an overview of the services that we provide, explain the focus of each and provide working examples on how to best use our data and to help generate insight from coronavirus-related scholarly communications.

We would also like to recognize that we are partnering with the Allen Institute for AI, Chan Zuckerberg Initiative (CZI), Georgetown University’s Center for Security and Emerging Technology (CSET), and the National Library of Medicine (NLM) at the National Institutes of Health to produce an open research dataset of scholarly literature about COVID-19, SARS-COV-2 and the coronavirus group.  Please visit the link below to access the dataset that was released on Semantic Scholar today:

COVID-19 Open Research Dataset (CORD-19) (opens in new tab)

Over the past week our team has worked to update our graph with the most recent publications regarding COVID-19.  With the support of Bing, we will double our MAG update frequency as well as publish a side stream linking WHO and PubMed publication ID’s to MAG ID’s.  Side stream source code (opens in new tab)

For anyone who would like to use the releases of our available services for further investigation, we provide the following summaries and examples:

The Microsoft Academic Graph (opens in new tab) is a heterogeneous graph of academic data.  The content of the graph is distributed in text files when new builds are created via a subscription.  This graph can be used as is or merged with other public or private data sets.

Primary use cases:  Data mining, long running analytic processing for academic analytics or Business Intelligence.

The Microsoft Academic Knowledge Exploration Service (MAKES) (opens in new tab) and its predecessor the Project Academic Knowledge API were created to serve a need for indexing and rapid data retrieval from the MAG data set.

Primary use cases:  Fast, top N entity retrieval, on-line scenarios such as dashboards or search applications.

The Microsoft Academic website (opens in new tab) is for the community to be used as both a research tool and an example of what can be built with MAG data and MAKES.  We have made our own modifications, just as consumers of MAG and MAKES may choose to do.  In this way we use it as an experimental platform to test hypotheses and new features for both MAG and MAKES.

Primary use cases: Finding relevant research and analytics, test drive an end to end solution of the services we provide.

The Microsoft Academic Graph

Our approach to generating the data

At current release, the Microsoft Academic Graph (MAG) contains over 233 million publications and their related academic entities, e.g. authors, publishing venues, associated concepts, etc.  Microsoft Research has the benefit of our partnership with Bing to crawl the internet to discover research content from around the world and update our graph regularly.  Using the power of cloud computing, we have built a pipeline where Bing data and other sources are filtered and analyzed.  Using AI techniques, we disambiguate, conflate, apply rank and taxonomy to entities into the graph as stated in our recent paper (opens in new tab): “The project pushes the boundary of machine cognition technology by deploying software agents trained with natural language understanding capabilities to continuously scavenge the Web for research artifacts and, from them, extract up-to-date academic knowledge into a graph based representation called Microsoft Academic Graph (MAG)”.  Optimization of this pipeline has reduced the time to create and validate our graph over time as well.  Currently, we can produce a new version of MAG every week.  For a great overview of how MAG is laid out and built, see our recently published paper (opens in new tab).

What is contained in the dataset and what makes it unique?

Web scale data collection

As stated, our Bing partnership brings the knowledge of the entire web into our graph.  Combining this with the power of cloud computing allows for rapid iteration over immense quantities of data (tens of billions of raw data points) using intelligent agents to objectively curate the graph.

Advanced entity conflation and disambiguation

Our agents are trained to understand and reason over partial and noisy information from documents in diverse data sources. They recognize and assemble semantic objects in the academic domain (e.g. scholarly publications, authors, affiliations, conferences, journals, and fields of study) into the cohesive and evolving knowledge graph (MAG).

Word/Phrase embeddings

Paper citation networks are well known to be sparse, have human bias, and be “cliquey” as researchers often cite papers by their advisors, friends and peers. It is rare to see inter-disciplinary citations even though researchers in disparate disciplines are often solving the same underlying scientific problem. MAG mitigates sparsity and clique issues in the graph by enriching paper-to-paper links across disciplines via a paper similarity system. This system not only uses the citation graph but also the content of each paper via trained language embeddings, as outlined in this paper (opens in new tab). The word embedding and citation-based paper recommendations can be found here (opens in new tab), in MAG.

Trained word embeddings are also used to generate embeddings for our fields of study, allowing us to quickly tag papers with relevant concepts based on their content. We provide the ability for users to tag their own text documents with our fields of study, using our trained language embeddings, as part of the Microsoft Academic Language Similarity API (opens in new tab). This API is made available to anyone upon request, alongside weekly MAG updates.

Field of study tagging and taxonomy learning

MAG is built and organized using field of study tagging and taxonomy learning allowing consumers of the graph the ability to sub-divide the data.  This is done through concept discovery, concept-document tagging and concept hierarchy generation.  A detailed explanation of this process is provided in our recent paper (opens in new tab).

In MAG, the fields of study can be found in this stream (opens in new tab) and their parent-child relationship can be found in this one (opens in new tab). The corresponding UMLS Ids and source URLs are available in this stream. (opens in new tab)

See an example of using fields of study below.

Predictive static ranking: Saliency

MAG computes saliency using reinforcement learning (RF) to assess the importance of each entity in the coming years.  As MAG sources contents from the Web, saliency plays a critical role in telling the difference between good and poor content.  The RF algorithm is programmed to predict future citations.  Based on the publication and citation activities surrounding the novel coronavirus, MAG has learned COVID-19 related articles are most likely to be cited in the coming years.  See our recent blog post (opens in new tab) for more details.

Saliency is available in MAG as the “rank” attribute (opens in new tab).

Multi-sense similarities

MAG is, in nature, a heterogeneous graph with different types of entities and relations; in which there exist various structural relations corresponding to different semantic similarities. For example, two fields of study can be similar in different senses, such as they might be often studied together (coappear in the same papers or venues) or cooccur with all types of entities in the graph. Therefore, we learn the multi-sense network representations for entities in MAG and make the Network Similarity (NS) package (opens in new tab) publicly available. By using the NS package, we can reveal the most similar fields to “COVID-19” and “SARS-COV-2” under different senses.

See an example of Multi-sense similarities below.

Examples:

Impact of COVID-19 on the Computer Science Research Community (opens in new tab) – Our teams’ research into the impact COVID-19 may have on conferences, authors and the Computer Science field of study.  (Source code coming soon).  Source code examples: https://github.com/microsoft/mag-covid19-research-examples/tree/master/src/MAG-Samples/impact-of-covid19-on-the-computer-science-research-community (opens in new tab)

How I built a list of coronavirus-related research papers using the Microsoft Academic Graph (opens in new tab) – Medium post by Adam Day outlining the process that was used to generate a list of COVID-19 related papers.

Multi-sense network similarity examples:

We show the most similar fields of study to COVID-19 and SARS-COV-2 under three different senses in MAG, powered by the NS package. Take COVID-19 for example, under the “copaper” sense, the top entities to it indicate the other fields that are discussed in COVID-19 publications, such as SARS-COV-2, H1N1, Ebola, Nipah, and MERS. Under “covenue”, the most similar entities to COVID-19 are fields—e.g., Infectious Disease Epidemiolog, Index case (patient zero), Middle East respiratory syndrome coronavirus—that are also studied in the journals or conferences in which the COVID-19 publications are published. Finally, the “metapath” sense represent that two fields are similar if they co-occur with all other types of entities—papers, venues, affiliations, and fields of study, under which, the most similar fields include Viral phylodynamics, Middle East respiratory syndrome coronavirus, Lassa fever, etc. Overall, we can observe different sets of similar fields under these three senses, revealing different perspectives about the focused entity, I.e., COVID-19 at this case.

Multi-Sense COVID-19 and SARS-COV2 results

 

Source code example:  https://github.com/microsoft/mag-covid19-research-examples/tree/master/src/MAG-Samples/NetworkSimilaritySample (opens in new tab)

Fields of study stamping examples:

Papers Similar to: The role of absolute humidity on transmission rates of the COVID-19 outbreak (opens in new tab)

Microsoft Academic provides highly related papers as recommended reading to this paper even though this paper currently has no citations. It therefore uses only trained word embeddings and the content of this paper. Notice the ability of the system to pick up relevant papers based on the broad concept of “climate and how it affects transmission of viruses”. Word embeddings allow the system to relate terms like “humidity”, “climate”, “tropical”, and “weather” together as well as “influenza”, “coronavirus”, and “virus”.

In the USQL sample code below, we show how Fields of Study are used along with publication title and abstract term matching to find papers about COVID-19.

Source code for this example: https://github.com/microsoft/mag-covid19-research-examples/tree/master/src/MAG-Samples/CoronavirusPapersSample (opens in new tab)

 

Microsoft Academic Knowledge Exploration Service (MAKES) / Project Academic API

MAKES was created in response to our customers request for a non-rate-limited version of our Project Academic Knowledge API.  In its basic form MAKES is a self-hosted REST API (opens in new tab) leveraging an index of all the entities in the Microsoft Academic Graph (MAG). By subscribing to MAKES, the required components are delivered to your Azure subscription when new versions of MAG are created; typically, once every 1 to 2 weeks. A provided tool can then be run and MAKES instances are automatically provisioned to your Azure account.

Examples:

MAKES Documentation (opens in new tab) – Documentation for self-hosting a MAKES API

Project Academic Knowledge Documentation (opens in new tab) – Documentation for the Microsoft hosted API

Querying MAKES / Project Academic API to retrieve papers about coronavirus – The following query can be given to MAKES / Project Academic API to produce a list of papers associated with coronavirus.  This selects publications based on the field of study group (coronavirus), family (coronaviridae), genus (betacoronavirus), species (SARS-COV-2/COVID-19) or title/abstract matches on those keywords.

Or(Composite(F.FN==’coronavirus disease 2019′), Composite(F.FN==’severe acute respiratory syndrome coronavirus 2′), Composite(F.FN==’betacoronavirus’), Composite(F.FN==’coronaviridae’), Composite(F.FN==’coronavirus’), W=’coronavirus’, AW=’coronavirus’, W=’coronaviridae’, AW=’coronaviridae’, W=’betacoronavirus’, AW=’betacoronavirus’)

 

Microsoft Academic website

The Microsoft Academic website is updated when new versions of the graph are released, about once a week.  The site is designed to provide top ‘n’ results of search queries through an easy to use interface.  Some advantages to using our website would be our improved search functionality that allows for semantic interpretations and suggestions for your queries.

Microsoft Academic - COVID-19 results

 

In the example above, you can see results from a query for coronavirus desease 2019, the topic for the current novel coronavirus.  You can see that we understand the topic from the query and populate a card to the right of the search results showing a description along with parent and related topics.  By selecting a publication from the list you are taken to a details page for that publication showing any information we collected about the publication (publishing venue, authors, institutions, links that we have currently found to the document on the web any topics that have been tagged for the publication).  In the bottom section of the page, we show any publications that are referenced, any publications that we have found to cite the publication and a tab for related publications that we have found.

Publication details page example

We also provide analytics for each entity type in our graph (publications (opens in new tab), authors (opens in new tab), topics (opens in new tab), conferences (opens in new tab), journals (opens in new tab), and institutions (opens in new tab)).  Our analytics pages allow you to search for Topics and find the top 100 entities, trends data and an overview of the distribution of entity types in the graph.

In Summary

The Microsoft Academic team is committed to providing the community with any data that can help stem the COVID-19 advance.  We hope that this blog post has offered some guidance and it will be updated as appropriate when details or related information changes.

 

Appendix A – Links to Microsoft Academic resources

Microsoft Academic Project (opens in new tab)

MAG Documentation and example code (opens in new tab)

MAKES Documentation and examples (opens in new tab)

Project Academic Documentation and examples (opens in new tab)

White House Office of Science and Technology Policy (opens in new tab)

 

Appendix B – Updates to this post

  • 3/16 – Original publication
  • 3/18 – Added links to source code examples and data files
  • 3/20 – Added link to sample code to search MAG for COVID-19 papers using Fields of Study along with, publication title and abstract term matching

The post Microsoft Academic resources and their application to COVID-19 research appeared first on Microsoft Research.

]]>
COVID-19 Highlights the Wisdom of the Academic Crowd http://approjects.co.za/?big=en-us/research/articles/covid-19-highlights-the-wisdom-of-the-academic-crowd/ Fri, 13 Mar 2020 17:22:06 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=642972 For more than a quarter century, we have lived a life where our online activities are closely monitored, collected, and exploited by internet giants to offer useful services without charging fees. Today, it is unfathomable to have to pay for Apple’s Facetime, Google’s Hangout, Facebook’s Messenger or What’s App, etc., the same way that we have to pay heftily to our wireless providers for the similar services in phone calls and text messages. These internet businesses prosper because the user data are worth a lot in our connected lives. Aside from creating detailed user profiles to place more targeted advertisements, the data holds the keys to identifying trends and project needs for new products and services.

The post COVID-19 Highlights the Wisdom of the Academic Crowd appeared first on Microsoft Research.

]]>
For more than a quarter century, we have lived a life where our online activities are closely monitored, collected, and exploited by internet giants to offer useful services without charging fees. Today, it is unfathomable to have to pay for Apple’s Facetime, Google’s Hangout, Facebook’s Messenger or What’s App, etc., the same way that we have to pay heftily to our wireless providers for the similar services in phone calls and text messages. These internet businesses prosper because the user data are worth a lot in our connected lives. Aside from creating detailed user profiles to place more targeted advertisements, the data holds the keys to identifying trends and project needs for new products and services.

One example is Google Flu Trends (GFT) which made headlines around the world in February 2013. It was reported that web search queries, with their timestamps and locations, could serve as a good indicator of the flu epidemics. Unfortunately, it was later found (opens in new tab) that GFT’s predictions often have a wide gap from the Center of Disease Control (CDC) data, even though GFT was specifically trained on CDC reports. Nevertheless, the “wisdom of the crowd” hidden in search queries is certainly valuable and a recent approach (opens in new tab) finds it beneficial to combine search data with electronic health records to improve the prediction accuracy.

To be sure, web search data are tricky to use correctly, especially in reliably tracking epidemics. At the time of writing, COVID-19 is rampaging around the world; with Italy and China both having to take drastic measures, such as locking down cities and cancelling schools to slow down the infections. In Microsoft’s backyard the city of Kirkland, made famous by Costco as its store brand, has seen deaths exceeding 25 with worrisome evidence of community spread since the first confirmed case was reported on January 21st, 3 weeks after the outbreak was noticed in Wuhan China and one day after China’s CDC declared an emergency.

According to Google Trends, however, search volume about coronavirus was insignificant until the day China declared the emergency, but the interests subsided in 10 days. For the first three weeks of February, the query volume continued to drop. It was not until February 21st, 10 days after the official name COVID-19 was announced by the World Health Organization (WHO), did the search query start to increase again, as can be seen in the Figure below. To ensure the trend plot is interpreted correctly, we contrast the search for “coronavirus” with “google.” The latter query is likely a result of a portion of internet users typing “google” into their web browser address bar, indicating a search intent. Nevertheless, it tracks the daily search activities and shows the cyclic nature of the search queries. The activity-normalized curve for “coronavirus” is shown as the dotted line against the secondary axis in the Figure.

Google Trends on Coronavirus

 

In the meantime, the research community has no illusions of the danger this novel coronavirus can pose to the world. Articles sounding the alarm began to be published in the journals the second week of January, one full week before China’s emergency declaration. Scientific activities prior to January 20th include events highlighted below:

 Date  Events
 12/27/2019 As recounted in this JAMA paper (opens in new tab), China CDC was first alerted 4 cases of unusual pneumonia in Wuhan, China
 12/31/2019 Wuhan’s health commission disclosed (opens in new tab) that 27 people, all having visited a local seafood market, had developed symptoms of a viral pneumonia. The public was advised to avoid crowded areas with poor air circulation, wear masks when going out, but there were no reasons to be alarmed. WHO was notified (opens in new tab) of the outbreak on the same day and the seafood market was closed the next day.
 1/7/2020 Chinese authorities conducted genome sequencing and identified the disease was caused by a novel coronavirus after ruling out previously known coronaviruses that caused SARS (opens in new tab), MERS (opens in new tab), and many others (opens in new tab).
 1/9/2020 WHO announced (opens in new tab) the discovery of a novel coronavirus by Chinese authorities, but stated the virus “does not transmit readily between people”.
 1/12/2020 China reported (opens in new tab) 1st death out of 41 confirmed cases and shared the genetic sequences with WHO for other countries to develop diagnostic kits.
 1/14/2020 A paper (opens in new tab) published in the International Journal of Infectious Diseases (opens in new tab) uses the term 2019-nCov to refer to the novel coronavirus. 7 of the 12 authors are members of the Pan-African Network on Emerging and Re-emerging Infections funded by European Horizon 2020.
 1/15/2020 WHO reported (opens in new tab) the first confirmed case in Japan and suggested the possibility of human-to-human transmissions after a few cases with no link to the seafood market were found in China.
 1/17/2020 An article (opens in new tab) in Science (opens in new tab) cautioned the spread of the virus after a tourist in Thailand was confirmed to be infected. Family Practice News reported (opens in new tab) 3 US airports started screening travelers from Wuhan China as soon as US CDC introduced the measure (opens in new tab).
 1/18/2020 Journal of Hospital Infection (opens in new tab) made available online this paper (opens in new tab) outlining measures to prevent hospital outbreak for the novel coronavirus. The article is officially scheduled to appear in the journal’s March 2020 issue.
 1/19/2020 A paper (opens in new tab) proposing a mathematical model of the novel coronavirus is published on bioRxiv. The model assumes animal-to-human transmission.
 1/20/2020 Both BMJ (opens in new tab) and Science (opens in new tab) reported surging cases of the virus infection in this (opens in new tab) and this (opens in new tab) article. Web search on this topic started to appear according to Google Trends.

 

With editors prioritizing the reviews and publications of COVID-19 papers, the volume of publications on this topic quickly rises. Furthermore, almost all academic publishers have dropped the paywalls on COVID-19 papers, accelerating the dissemination of the scientific discoveries and the citations to these articles. In total, more than 920 articles have been published in the crucial month between January 14 and February 14, one week ahead of the turning point in Google Trends. The collective responses of the scholarly community to COVID-19 are captured in the remarkable publication node growth and the citation edges in Microsoft Academic Graph (MAG), as can be seen below:

 

COVID-19 Publication counts

These behavioral changes have significant impacts on MAG. First, the fast appearing publications have led our machine readers to conclude not one, but two new concepts have emerged under the existing concept “coronavirus (opens in new tab)” (MAG Id=2777648638) and “infectious diseases (opens in new tab)” (MAG Id=524204448). Based on the operations described in our recent paper (opens in new tab), the new concept under the coronavirus is called “Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)”, or simply “sars-cov-2 (opens in new tab)” (MAG Id=3007834351), and is mapped to this Wikipedia page (opens in new tab). The other new infectious disease is unsurprisingly called “Coronavirus disease 2019 (COVID-19)”, or simply “covid-19 (opens in new tab)” (MAG Id=3008058167), and is mapped to the corresponding Wikipedia page (opens in new tab). Note that the subtle distinction between a disease and its cause is an ongoing issue for our machine learning algorithm. As described in the paper, the natural outcome based on the language and the network similarities (opens in new tab), is to follow the classifications from the authoritative webpages which, in this case, are the two Wikipedia articles that make the distinction. However, as described in our recent post (opens in new tab), we also program our system to be as consistent as the Unified Medical Language System (UMLS) (opens in new tab), where such distinctions are sometimes not made (e.g., the “HIV/AIDS (opens in new tab)” concept originated from UMLS (opens in new tab) lumps the disease AIDS and its cause HIV together). The differences among the authoritative sources we use lead to inconsistencies in our taxonomy. We are anxiously waiting to see how our system will resolve this issue by itself after more publications and citations are observed in the future.

Secondly, the rapid rise of citation activities among these papers has also boosted their saliency scores, which are an indicator that predict the likelihoods of papers to receive citations in the next 5 years. As explained in this (opens in new tab) and this paper (opens in new tab), saliency uses the same reinforcement learning (RF) algorithm as in Alpha Go and other video game playing systems to anticipate the next moves from humans. Instead of sending the RF learning agents into parallel universes to play thousands of games simultaneously, here we send the RF agents to travel back and forth in time to acquire the best strategy in predicting future citations. In this sense saliency is a leading rather than a lagging indicator, e.g. the citation count, of the research impact of any given paper. Due to the flurry of citations among them a mere month is all it takes for the COVID-19 papers to dominate the search results for the query “coronavirus china (opens in new tab)”, even though they all have yet to receive their full recognitions and their citation counts are much lower than other coronavirus papers published years ago. To see the differences in rankings by saliency and by citation counts, snapshots of the search results ranked by the saliency (relevance) and the citation counts based on data up until February 14, 2020 are shown below. Note that the term “COVID-19” was adopted by WHO (opens in new tab) only on February 11, the same day the Coronavirus Study Group published this naming paper on bioRxiv (opens in new tab) proposing to use SARS-Cov-2 instead of 2019-nCov. Papers accepted for publications prior to this date cannot be retrieved with these keywords. However, these early publications are detected as discussing topics about “coronavirus” and “china (opens in new tab)” (MAG Id = 191935318). The query is thus useful in finding these early publications.

Saliency rank of coronavirus publications

Citation count of coronavirus publications

So how effective are search and click behaviors underlying Google’s and Bing’s rankings? Our experience shows that searching behaviors are useful in capturing general consumer interests. However, for cases such as pandemics that require expert knowledge, the consumer behaviors are in fact a lagging indicator. For this reason, and to respect the privacy of our users, Microsoft Academic is a search engine that does not include user private data such as browsing and clicking activities in the search ranking. We can, however, compare the rankings in the equivalent Google Scholar to understand the effectiveness of using search behaviors. Indeed, when the search volume on the topic of “coronavirus” was not enough, older papers that have accumulated larger citation counts were ranked higher. By the time of February 18, 2020, however, some papers about COVID-19 started to make it to the top 10 search results. To be fair, by this time most papers have included the terminology “2019-nCov” to clarify which novel coronavirus the contents are about. As the screenshots below show, Google Scholar did an excellent job for this query.

The field of study known as infodemiology (opens in new tab), established in 2002 based on the insight of Gunther Eysenbach (opens in new tab)’s ground breaking paper (opens in new tab), has been studying the utilities of search queries and social media activities in capturing public health (opens in new tab) issues in a more accurate and timely manner. To the best of our knowledge (based on this query (opens in new tab)), we believe we are the first to observe scholarly communication activities as a potential resource in tracking epidemics. More research is needed, but one thing is for sure: without the scientists actively submitting new findings and editors prioritizing their publications, we wouldn’t have been able to report this useful dataset. The credit should really go to the collaborative instincts and the scholarships of the research community that we are very proud to be a member of.

Happy researching!

Google Scholar papers on Coronavirus

Google Scholar papers on Coronavirus

The post COVID-19 Highlights the Wisdom of the Academic Crowd appeared first on Microsoft Research.

]]>
Impact of COVID-19 on the Computer Science Research Community http://approjects.co.za/?big=en-us/research/articles/impact-of-covid-19-on-computer-science-research-community/ Tue, 10 Mar 2020 18:07:55 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=642132 Following the Microsoft Academic team's goal to “help researchers stay on top of their game”, we are providing an analysis of the COVID-19 impact on the computer science (CS) research community to help enable conference organizers and institutions to respond accordingly and manage the impact.

The post Impact of COVID-19 on the Computer Science Research Community appeared first on Microsoft Research.

]]>
By March 10th, 2020 the novel coronavirus (COVID-19) has infected ~117k and been responsible for the deaths of over 4k people worldwide. The World Health Organization (WHO) has not yet classified the COVID-19 outbreak as a pandemic, however, COVID-19 has resulted in a significant impact on individual lives and global economics. Following the Microsoft Academic team’s goal to “help researchers stay on top of their game”, we are providing an analysis of the COVID-19 impact on the computer science (CS) research community to help enable conference organizers and institutions to respond accordingly and manage the impact.

Our analysis shows that:

  • There is increasing author participation in both CS and AI conferences from the COVID-19 impacted regions in the past ten years, from about 6% to over 20% in 2019.
  • The disease could impact one-fifth of conference attendees in both AI and CS (21.3%).
  • May to early September is the period most CS conferences occur, controlling COVID-19 by May will help to reduce the impact in CS research to a minimum.

Background

  • As of this article date, the US Center for Disease Control and Prevention (CDC) has issued a travel warning and alert (opens in new tab) recommending travelers avoid or postpone nonessential travel to areas including: China, South Korea, Iran, Italy and Japan. A handful of airlines have reduced or suspended flights to these areas. Also, some cities have imposed lockdowns to contain the virus.
  • On the other hand, CS conference publications rely heavily on authors’ physical presentations during the conference. Since the travel interruption could potentially prevent researchers from attending conferences and thus affecting the publications, this analysis considers the areas above as COVID-19 impacted.
  • In computer science, conference publication is often preferred over other publications for its higher visibility, greater impact, and faster turnaround time. The COVID-19 impact is most likely to affect the CS research community through conference publications, hence our focus.

Method

  • We have picked the 105 most impactful CS conferences for this analysis. Among these 105 CS conferences, 32 are related to artificial intelligence (AI) and used to analyze the COVID-19 impact on AI research
  • Two scenarios are evaluated for impact:
    • 2020 conferences being hosting in the impacted areas
    • Authors located in the impacted area. Headquarter locations for each authors’ last known affiliation are used to determine their location
  • Two indicators are used to estimate the impact:
    • The number of publications for each CS conferences in 2019 (2018 if it’s biennial)
    • The share of publications from COVID-19 affected areas for the past 10 years
  • A publication is considered from the COVID-19 affected areas if the headquarter of the first author’s affiliation is located in those areas. Please find the discussion of this approach in section “How to Determine a Publication is Impacted by COVID-19” below
  • All publication data is sourced from the Microsoft Academic Graph (opens in new tab)

Analysis and Discussion

1. CS and AI Conference Publication Statistics

The graph below (Figure 1a) shows the total number of CS and AI conference publications in 2019. Only the top 20 regions are shown here. The US followed by the EU, China, Japan and Canada had the highest volume of published papers among the 105 selected CS conferences as well as in the 32 AI conferences.

The table below shows the number of 2019 publications and percentages for the CDC warning/alert areas. Regions are categorized according to the US CDC travel risk assessment, please refer to the CDC (opens in new tab) for the description of each level.  China, Iran, South Korea, Italy, Japan and Hong Kong together contributed 21.25% and 21.33% to CS and AI publications in 2019. This could be the rate of authors who couldn’t attend the conferences in 2020 due to the travel interruption by COVID-19 in these areas.

To further confirm the rate of publications from impacted areas, we gathered data between 2000 and 2019. As shown in Figure 1b, there is clearly an increase in publications from the COVID-19 impacted areas. The impacted publication rate for both CS and AI conferences are above 21% in 2019 and possibly higher in 2020.

2. 2020 CS Conference Publications Impact – by Conference Location

The graph below (Figure 2) shows the estimated impact on 2020 CS conferences hosting in COVID-19 impacted areas. The number of publications from 2019 is used to estimate the number of publications in 2020. The solid blue line shows the accumulated number of impacted publications over time.

3. 2020 CS and AI Conference Publications Impact – by Author Location

The graph below (Figure 3) shows 2020 CS and AI conferences which are hosting outside COVID-19 impacted areas. The numbers of publications for each impacted conference are estimated by the number of publications in 2019 (2018 if it’s biennial) from COVID-19 impacted areas. According to our analysis, among the CS conferences scheduled in the coming four months, ICC, IMTC, ICDE and ISCAS have the most publications contributed from the COVID-19 impacted areas (each above 30%). We listed the conferences in the next four months with the 2019 publication statistics in Appendix 1 at the end.  A majority of the AI conferences for the next four months have 10% to 20% impact rate based on 2019 data (Appendix 2).

   

4. 2020 CS and AI Conference Publications Impact – Total

The graph below (Figure 4) shows the COVID-19 impact estimates in CS and AI conferences combining the impact from both conference location and author location. The impact has a similar pattern in CS and AI conferences. Starting from May 2019, the impact increases considerably as many conferences occur during the summer months (northern meteorological). If COVID-19 can be contained and the travel interruption is lifted before May 2020, the impact on CS conferences should be minimal. On the contrary, if the outbreak situation cannot be improved by September, there could be significant impact to the CS research community.

5. How to Determine if a Publication is Impacted by COVID-19

As mentioned earlier, we consider a publication to be impacted by COVID-19 if the headquarters of the first author’s affiliation is in one of the affected areas.

We choose the first author’s affiliation location instead of all authors because 1) first author normally is the presenter of the paper and 2) it simplifies our analysis while not significantly impacting the result. A previous paper (opens in new tab) pointed out there are 25-fold increases in international collaborations for scientific development. For the CS publications we analyzed, the cross-region collaboration increases from 7.8% to 23.9% in the past 20 years. Although the cross-region rate is high, only 4% of publications have authors from non-impacted regions while first author is located in impacted regions. Therefore, we believe the first author’s location is a good representation of the publication’s locations.

In the case that the first author is associated with multiple affiliations and one affiliation is in affected areas, we count the publication as affected. Only 0.17% of CS publications have first authors associated with multiple affiliations.

Some affiliations could have multiple locations, such as Microsoft. The headquarter location is used in this scenario. And we estimate there are less than 2% such cases.

Conclusion

All the above estimates are based on the the most current information we could obtain using MAG. If the current situation continues, the data shows the potential for significant impact on CS conferences unless conference organizers take actions to mitigate the impact.

Some conference organizers have already taken actions, such as:

  • Postpone and change location. INFOCOM 2020, which was originally planned to be in Beijing China in late April, is moving to Toronto Canada in July.
  • Create backup plans. The ACM SIGIR Executive Committee is preparing a backup plan for SIGIR 2020 for potential worst case scenarios, e.g. moving from Xi’an China to Toronto Canada if the WHO extend the “Public Health Emergency of International Concern” by the end of April.
  • Extend deadlines. The Web Conference 2020 is extending the early bird deadline by four weeks to give attendees more flexibility to plan their trip.
  • Enable remote and video presentations. AAAI 2020 enabled authors to present remotely using teleconferencing or by submitting a video presentation.

Additional CS Conference Updates regarding COVID-19

In an effort to help the CS community we will continue to monitor CS conference announcements regarding COVID-19 and provide updates below:

 

Stay healthy and research on!

 

 

Appendix 1

2020 March to June, Non-AI CS Conferences.

Appendix 2

2020 March to June, AI Conferences.

The post Impact of COVID-19 on the Computer Science Research Community appeared first on Microsoft Research.

]]>