{"id":597991,"date":"2019-07-21T18:06:43","date_gmt":"2019-07-22T01:06:43","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=597991"},"modified":"2019-07-22T10:36:34","modified_gmt":"2019-07-22T17:36:34","slug":"learning-web-search-intent-representations-from-massive-web-search-logs","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/learning-web-search-intent-representations-from-massive-web-search-logs\/","title":{"rendered":"Learning web search intent representations from massive web search logs"},"content":{"rendered":"

\"\" (opens in new tab)<\/span><\/a><\/p>\n

Have you ever wondered what happens when you ask a search engine to search for something as seemingly simple as \u201chow do you grill salmon\u201d? Have you found yourself entering multiple searches before arriving at a webpage with a satisfying answer? Perhaps it was only after finally<\/em> entering \u201chow to cook salmon on a grill\u201d that you found the webpage you wanted in the first place, leaving you wishing search engines simply had the intelligence to understand that when you entered your initial search, your intent was to cook the salmon on a grill.<\/p>\n

Microsoft has taken a step toward providing a deeper understanding of web search queries with Microsoft Generic Intent Encoder, or MS GEN Encoder, for short. The neural network maps queries with similar click results to similar representations, enabling it to capture what people expect to see and want to click as a result of a specific search as opposed to just a query\u2019s semantic meaning. With this technology, search engines won\u2019t only recognize that \u201chow do you grill salmon\u201d and \u201chow to cook salmon on a grill\u201d are the same, but also understand that while you may enter \u201cmiller brain disease,\u201d results for \u201cmiller syndrome lissencephaly\u201d would be equally relevant.<\/p>\n

MS GEN Encoder, which was trained on hundreds of millions of Bing (opens in new tab)<\/span><\/a> web searches, is currently being used in the Microsoft search engine, and we\u2019re thrilled to announce that we\u2019re making the functionality of the technology\u00a0 available to academic researchers as an Azure service (opens in new tab)<\/span><\/a>. We hope such access, which is being overseen by program manager Maria Kang (opens in new tab)<\/span><\/a> and software engineer Zhengzhu Feng (opens in new tab)<\/span><\/a>, will help accelerate research in the academic community by allowing researchers to tap into the power of users\u2019 behavioral data provided by the large-scale search logs MS GEN Encoder leverages.<\/p>\n

\"In

(opens in new tab)<\/span><\/a> In weak supervision, if two queries in the logs result in the same document, they\u2019re considered to have the same or similar intent. This may not always be true, which is why it\u2019s a \u201cweak\u201d label.<\/p><\/div>\n

MS GEN Encoder and the challenge of intent<\/h3>\n

Understanding web search intent\u2014what people want to see and will click on\u2014requires a deep knowledge of both web content and a person\u2019s informational needs based on their search. In particular, a person may choose a phrase like the above\u2019s \u201cmiller brain disease\u201d to find information on the condition, while the author of a webpage may use the expression \u201cmiller syndrome lissencephaly.\u201d This \u201cvocabulary mismatch\u201d problem\u2014first identified and described by Microsoft Technical Fellow and Deputy Managing Director Susan Dumais and her colleagues (opens in new tab)<\/span><\/a> in the 1980s, prior to Dumais (opens in new tab)<\/span><\/a> joining Microsoft\u2014is due primarily to the high variability and flexibility language provides us to say the same thing in many different ways.<\/p>\n

To help overcome this and other semantic challenges, we and co-authors Hongfei Zhang (opens in new tab)<\/span><\/a>, Xia Song (opens in new tab)<\/span><\/a>, Nick Craswell (opens in new tab)<\/span><\/a>, and Saurabh Tiwary (opens in new tab)<\/span><\/a> turned to deep learning methods to train MS GEN Encoder to specifically identify the intent behind language used in search queries and learn a representation of each query such that similar intents are mapped to similar embeddings.<\/p>\n

A two-phase training strategy<\/h3>\n

To appropriately model search intent, we deploy a two-phase training strategy. In the first phase\u2014weak supervision<\/em>\u2014we leverage large-scale click signals in Bing search logs as an approximation of a person\u2019s search intent and train a recurrent neural network model to map search queries clicking on the same URLs close in the embedding space. For example, different search queries that result in the selection of the same URLs are interpreted as likely looking for the same results despite differences in word choice and mapped closer together. This type of weak supervision reduces the need for manual labeling and is useful for learning a very rich model from data available at scale\u2014the type of interaction data we have in abundance from Bing search logs and which doesn\u2019t contain people\u2019s personal data.<\/p>\n

In the second phase of training, manually labeled data is introduced in a multi-task learning setting to extend the generalization ability of MS GEN Encoder to unseen search queries. In this phase, the encoder is trained on datasets with query or question pairs manually labeled by human annotators as having or not having similar search intents. These additional tasks both steer the model to greater generalization and help provide human oversight on the semantics encoded by the neural network.<\/p>\n

\"MS

(opens in new tab)<\/span><\/a> MS GEN Encoder\u2019s recurrent neural network architecture uses a hybrid character and word embedding to address never-before-seen search terms, whether they be a result of a misspelling, new concepts, or words borrowed from a different language.<\/p><\/div>\n

Alleviating tail sparsity<\/h3>\n

The power of the model lies in how it handles very rare, or tail<\/em>, search queries. Search engines encounter a large number of queries that either are searched infrequently or have never been searched at all because of the variety in language, misspelled words, rare concept names, product IDs, new trending topics, and ever-evolving words borrowed from different languages. This is a phenomenon known as a long tail distribution<\/em> (opens in new tab)<\/span><\/a>, and it can lead to poor search results.<\/p>\n

To handle terms that haven\u2019t been seen before in the model\u2019s training data, we designed a new recurrent neural network architecture that uses a hybrid character and word embedding as the first layer within a more common multilayer sequential modeling architecture. This hybrid embedding gives the model flexibility to manage language variations and unseen terms. For example, the misspelled word \u201crestarant\u201d and the word \u201crestaurant\u201d are mapped to similar embeddings by MS GEN Encoder, as they share similar character sequences and also lead to clicks on similar web content.<\/p>\n

MS GEN Encoder proved capable of addressing the long tail sparsity challenge with high precision. In our study, we first removed navigational queries, adult queries, and very common queries for all of the following analysis. From a six-month period, we collected a uniform sample of 700 million queries, then collected a set of 1 million queries sampled immediately after. In the sample of 1 million queries, we defined those queries that had less than 16 occurrences in the larger set to be tail-ish<\/em>. In fact, 39 percent of the 1 million queries were so tail-ish<\/em> they had never been seen at all in the historical set. However, \u201cexpanding\u201d the unseen queries with their approximate nearest neighbors reduced the number of unseen searches to only 20 percent by matching unseen searches to historical searches with very similar MS GEN encodings.<\/p>\n

Bonus capability\u2014identifying higher-level search goals<\/h3>\n

While MS GEN Encoder was trained to map search queries with the same intent into similar representations, an unplanned\u2014and interesting\u2014capability arose: MS GEN Encoder naturally reflects different categories of search behaviors based on how similar the embeddings of two queries are.<\/p>\n

In the table below, each row includes two search queries from the same person; the query in the second column was entered shortly after the first. In the second pair of queries, the individual clearly entered a different but related term, both falling under the larger topic of Revolutionary War battles. This relationship is identified despite the typo \u201c2776\u201d instead of \u201c1776.\u201d In this case, the person was likely in a \u201clearning mode,\u201d seeking to gain a broad understanding of the topic. The third pair of queries indicates the individual was looking for more specific information, while the fourth pair demonstrates a reformulation with the same intent. MS GEN Encoder is able to quantify these relationships to get a sense of people\u2019s search behavior. Such insights can help improve downstream tasks such as ranking and query suggestion.<\/p>\n

\"MS

(opens in new tab)<\/span><\/a> MS GEN Encoder can help categorize search query pairs to see where people are going in their search sessions. For instance, if a person is asking about different battles in the Revolutionary War, then his or her trajectory is lateral within that topic.<\/p><\/div>\n

The work behind MS GEN Encoder is further detailed in our paper “Generic Intent Representation in Web Search,” (opens in new tab)<\/span><\/a> which we\u2019re presenting at the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (opens in new tab)<\/span><\/a>. We encourage readers to check it out, and if you\u2019re an academic researcher, we invite you to take the next steps in trying out MS GEN Encoder (opens in new tab)<\/span><\/a>, which you can do using the Microsoft Machine Reading Comprehension, or MS MARCO, dataset (opens in new tab)<\/span><\/a>. We\u2019ve already started onboarding a few universities from the United States and Australia and are excited to see what findings come out of these research studies.<\/p>\n","protected":false},"excerpt":{"rendered":"

Have you ever wondered what happens when you ask a search engine to search for something as seemingly simple as \u201chow do you grill salmon\u201d? Have you found yourself entering multiple searches before arriving at a webpage with a satisfying answer? Perhaps it was only after finally entering \u201chow to cook salmon on a grill\u201d […]<\/p>\n","protected":false},"author":38022,"featured_media":597994,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"categories":[194467,243682],"tags":[],"research-area":[13556,13547],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-597991","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-artifical-intelligence","category-systems-networking","msr-research-area-artificial-intelligence","msr-research-area-systems-and-networking","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[691494,649749],"related-events":[597364],"related-researchers":[{"type":"guest","value":"corby-rosset","user_id":"598003","display_name":"Corby Rosset","author_link":"Corby Rosset<\/a>","is_active":true,"last_first":"Rosset, Corby ","people_section":0,"alias":"corby-rosset"}],"msr_type":"Post","featured_image_thumbnail":"\"a","byline":"Paul Bennett, Chenyan Xiong, and Corby Rosset<\/a>","formattedDate":"July 21, 2019","formattedExcerpt":"Have you ever wondered what happens when you ask a search engine to search for something as seemingly simple as \u201chow do you grill salmon\u201d? Have you found yourself entering multiple searches before arriving at a webpage with a satisfying answer? Perhaps it was only…","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/597991"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/38022"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=597991"}],"version-history":[{"count":9,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/597991\/revisions"}],"predecessor-version":[{"id":598926,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/597991\/revisions\/598926"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/597994"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=597991"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=597991"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=597991"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=597991"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=597991"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=597991"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=597991"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=597991"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=597991"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=597991"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=597991"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}