{"id":661779,"date":"2020-05-21T18:07:35","date_gmt":"2020-05-22T01:07:35","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=661779"},"modified":"2020-05-21T18:07:35","modified_gmt":"2020-05-22T01:07:35","slug":"rationalizing-semantic-and-keyword-search-on-microsoft-academic-2","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/rationalizing-semantic-and-keyword-search-on-microsoft-academic-2\/","title":{"rendered":"Rationalizing Semantic and Keyword Search on Microsoft Academic"},"content":{"rendered":"
Over the past 6 months we’ve been experimenting with a host of changes to Microsoft Academic’s search experience, and now that the last of those experiments has shipped we’re excited to finally discuss them.<\/p>\n
Before we jump in, if you’re interested in a deeper technical analysis of the new capabilities please review the following resources:<\/p>\n
From the initial release of Microsoft Academic in 2016, up until 6 months ago, our semantic search algorithm focused on generating results that best matched semantically coherent interpretations of user queries, informed by the Microsoft Academic Graph (MAG) (opens in new tab)<\/span><\/a>.<\/p>\n To better explain, let’s examine the query \u201ccovid-19 science\u201d. Traditional search engines based on keyword search (i.e. Google Scholar, Semantic Scholar, Lens.org, etc.) do an excellent job of retrieving relevant results that have keyword matches for “covid-19” and variations of “science” (science, sciences, scientific, etc.) Our system, however, prefers to interpret \u201ccovid-19\u201d as a shorthand reference (synonym) of the topic “Coronavirus disease 2019 (COVID-19)” (opens in new tab)<\/span><\/a> and \u201cscience\u201d as the journal “Science” (opens in new tab)<\/span><\/a> because MAG suggests this interpretation will turn up more highly cited and relevant papers than treating the query as simple paper full-text (title\/abstract\/body) keywords. This distinction is important, as it allows our semantic search algorithm to leverage semantic inference to retrieve seminal publications that do not strictly contain “covid-19” as keywords, yet are nevertheless relevant and important.<\/p>\n Regardless, we still previously allowed for rudimentary keyword matching, namely, prefix and literal unigram matching of publication titles (with no support for stemming or spelling corrections). Unfortunately, the outcome of this limited keyword matching was frequently encounters with the dreaded “no results” page.<\/p>\n For example, assume you were looking for a paper that you thought<\/em> was named “heterogeneous network embeddings via deep architectures”. Entering this phrase as a query would result in no suggestions and an error page if executed on the site:<\/p>\n <\/p>\n This is a classic case of users knowing what they want but having difficulty getting an algorithm to understand. A common problem with keyword search is it puts the burden of choosing the \u201cright\u201d keywords for a query squarely on the shoulder of the user.<\/p>\n Now with our newest search implementation this same query will work exactly as intended:<\/p>\n <\/p>\n To understand why this now works we first need to explain how our semantic search implementation works.<\/p>\n To put it simply, we’ve changed our semantic search implementation from a strict form where all terms must be understood<\/span> to a looser form where as many terms as possible are understood<\/span>.<\/p>\n The formulation of semantic interpretations (as explained above) remains unchanged, in that the knowledge in MAG still plays the central role in guiding how a query should be interpreted. What has<\/em> changed is that when a portion of a query is thought to refer to full-text properties (i.e. title, abstract), the algorithm can now dynamically switch to a new scoring function that is more appropriate than literal unigram matching and hence less brittle as the example above shows.<\/p>\n Going a bit deeper, let’s define what “as many terms as possible are understood” means. By its nature, loose semantic query interpretation will produce interpretations with the highest coverage first and fastest, and as interpretations with less coverage (i.e. terms are dropped from consideration) are generated the relevance and speed decrease. The reasons for this are technical and have to do with the search space growing exponentially as the query considered becomes less specific. So in practice “as many as possible” is better defined as “as many as possible in a fixed amount of time”.<\/p>\n This means that factoring in variables such as query complexity and service load, the results generated from a fixed timeout where terms are more loosely matched (aka the result \u201ctail\u201d) could vary between sessions. However because the interpretations with highest coverage are generated first, the results they cover (aka the “head”) are very stable.<\/p>\n While this change is a great remedy for queries with full-text matching intent, the loosened interpretation does also impact semantic search results as they are no longer as concise as before due to a longer result “tail” that includes full-text matches.<\/p>\n As always, an example speaks a thousand words:<\/p>\nOk, maybe a little<\/em> room for interpretation<\/h2>\n