{"id":1017150,"date":"2024-03-27T15:22:24","date_gmt":"2024-03-27T22:22:24","guid":{"rendered":""},"modified":"2024-04-12T14:39:48","modified_gmt":"2024-04-12T21:39:48","slug":"learning-from-interaction-with-microsoft-copilot-web","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/learning-from-interaction-with-microsoft-copilot-web\/","title":{"rendered":"Learning from interaction with Microsoft Copilot (web)"},"content":{"rendered":"\n
\"flowchart (opens in new tab)<\/span><\/a><\/figure>\n\n\n\n

AI systems like Bing and Microsoft Copilot (web) are as good as they are because they continuously learn and improve from people\u2019s interactions. Since the early 2000s, user clicks on search result pages have fueled the continuous improvements of search engines. Recently, reinforcement learning from human feedback (RLHF) brought step-function improvements to response quality of generative AI models. Bing has a rich history of success in improving its AI offerings by learning from user interactions. For example, Bing pioneered the idea of improving search ranking (opens in new tab)<\/span><\/a> and personalizing search (opens in new tab)<\/span><\/a> using short- and long-term user behavior data (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n

With the introduction of Microsoft Copilot (web), the way that people interact with AI systems has fundamentally changed from searching to conversing and from simple actions to complex workflows. Today, we are excited to share three technical reports on how we are starting<\/em> to leverage new types of user interactions to understand and improve Copilot (web) for our consumer customers. [1]<\/a><\/p>\n\n\n\n

How are people using Copilot (web)?<\/h2>\n\n\n\n

One of the first questions we asked about user interactions with Copilot (web) was, \u201cHow are people using Copilot (web)?\u201d Generative AI can perform many tasks that were not possible in the past, and it\u2019s important to understand people\u2019s expectations and needs so that we can continuously improve Copilot (web) in the ways that will help users the most.<\/p>\n\n\n\n

A key challenge of understanding user tasks at scale is to transform unstructured interaction data (e.g., Copilot logs) into a meaningful task taxonomy<\/a>. Existing methods heavily rely on manual effort, which is not scalable in novel and under-specified domains like generative AI. To address this challenge, we introduce TnT-LLM (T<\/strong>axonomy Generation and T<\/strong>ext Prediction with LLM<\/strong>s)<\/a>, a two-phase LLM-powered framework that generates and predicts task labels end-to-end with minimal human involvement (Figure 1).<\/p>\n\n\n\n

\"The
Figure 1. Comparing our TnT-LLM framework against existing methods in terms of interpretability and scalability.<\/figcaption><\/figure>\n\n\n\n

We conducted extensive human evaluation to understand how TnT-LLM performs. In discovering user intent and domain from Copilot (web) conversations, taxonomies generated by TnT-LLM are significantly more accurate than existing baselines (Figure 2).<\/p>\n\n\n\n

\"The
Figure 2. Evaluating the performance of TnT-LLM on user intent taxonomy generation. Error bars indicate 95% confidence intervals.<\/figcaption><\/figure>\n\n\n\n

We applied TnT-LLM to a large-scale number of fully de-identified Copilot (web) conversations and traditional Bing Search sessions. The results<\/a> (Figure 3) suggest that people use Copilot (web) for knowledge work tasks in domains such as writing and editing, data analysis, programming, science, and business. Further, tasks done in Copilot (web) generally are of higher complexity and more knowledge work-oriented compared to tasks done in traditional search engines. Generative AI’s emerging capabilities have evolved the tasks that machines can perform, to include some that humans have traditionally had to do without assistance. Results demonstrate that people are doing more complex tasks, frequently in the context of knowledge work, and show that this type of work is being newly assisted by Copilot (web).<\/p>\n\n\n\n

\"The<\/a>
Figure 3. Comparing the distribution of topical domains and task complexity between Bing search (left) and Copilot (web) (right).<\/figcaption><\/figure>\n\n\n\n

Estimating and interpreting user satisfaction<\/h2>\n\n\n\n

To effectively learn from user interactions, it is equally important to classify user satisfaction and to understand why people are satisfied or dissatisfied while trying to complete a given task. Most important, this will allow system developers to identify areas of improvement and to amplify and suggest successful use cases for broader groups of users.<\/p>\n\n\n\n

People give explicit and implicit feedback when interacting with AI systems. In the past, user feedback was in the form of clicks, ratings, or survey verbatims. When it comes to conversational systems like Copilot (web), people also give feedback in the messages they send during the conversations (Figure 4).<\/p>\n\n\n\n

\"The
Figure 4. Illustrations of how people may give feedback to a chatbot in their messages.<\/figcaption><\/figure>\n\n\n\n

To capture this new category of feedback signals, we propose our Supervised Prompting for User Satisfaction Rubrics (SPUR) (opens in new tab)<\/span><\/a> framework (Figure 5). It\u2019s a three-phase prompting framework for estimating user satisfaction with LLMs:<\/p>\n\n\n\n

    \n
  1. The supervised extraction prompt<\/strong> extracts diverse in situ<\/em> textual feedback from users interacting with Copilot (web).<\/li>\n\n\n\n
  2. The summarization rubric prompt<\/strong> identifies prominent textual feedback patterns and summarizes them into rubrics for estimating user satisfaction.<\/li>\n\n\n\n
  3. Based on the summarized rubrics, the final scoring prompt<\/strong> takes a conversation between a user and the AI agent and rates how satisfied the user was.<\/li>\n<\/ol>\n\n\n\n
    \"The (opens in new tab)<\/span><\/a>
    Figure 5. Framework of Supervised Prompting for User Satisfaction Rubrics.<\/figcaption><\/figure>\n\n\n\n

    We evaluated our framework on fully de-identified conversations with explicit user thumbs up\/down in Copilot (web) (Table 1). We find that SPUR outperforms other LLM-based and embedding-based methods, especially only limited human annotations of user satisfaction are available. Open-source reward models used for RLHF cannot be a proxy for user satisfaction, because reward models are usually trained with auxiliary human feedback that may differ from the feedback from the user who was involved in the conversation with the AI agent.<\/p>\n\n\n\n

    Method<\/th>Weighted F1-score<\/th><\/tr><\/thead>
    Reward (RLHF)<\/td>17.8<\/td><\/tr>
    ASAP (SOTA of embedding)<\/td>57.0<\/td><\/tr>
    Zero-Shot (GPT4)<\/td>74.1<\/td><\/tr>
    SESRP (GPT4)<\/td>77.4<\/strong><\/strong><\/td><\/tr><\/tbody><\/table>
    Table 1. Performance comparison between models for user satisfaction estimation.<\/center><\/figcaption><\/figure>\n\n\n\n

    Another critical feature of SPUR is its interpretability. It shows how people express satisfaction or dissatisfaction (Figure 6). For example, we see that users often give explicit positive feedback by clearly praising the response from Copilot (web). Conversely, they express explicit frustration or switch topics when encountering mistakes in the response from Copilot (web). This presents opportunities for providing customized user experience at critical moments of user satisfaction and dissatisfaction, such as context and memory reset after switching topics.<\/p>\n\n\n\n

    \"The
    Figure 6. SPUR reveals the distribution of satisfaction and dissatisfaction patterns among conversations with explicit user upvotes or downvotes.<\/figcaption><\/figure>\n\n\n\n

    In the user task classification discussed earlier, we know that people are using Copilot (web) for knowledge work and more complex tasks. As we further apply SPUR for user satisfaction estimation, we find that people are also more satisfied when they complete or partially complete cognitively complex tasks. Specifically, when regressing task complexity on the SPUR-derived summary user-satisfaction score, we find generally increasing coefficients on increasing levels of task complexity when using the lowest level of task complexity (i.e. Remember) as a baseline, provided the task was at least partially completed (see Table 2). For instance, partially completing a Create-level task, which is the highest level of task complexity, leads to an increase in user satisfaction that is more than double the increase when partially completing an Understand-level task. Fully completing a Create-level task leads to the largest increase in user satisfaction.<\/p>\n\n\n\n

    \"The
    Table 2. Regression results where the dependent variable is user satisfaction. In general, the more complex the task, the more satisfied the user whether it was partially or totally completed.<\/figcaption><\/figure>\n\n\n\n

    These three reports present a comprehensive and multi-faceted approach to dynamically learning from conversation logs in Copilot (web) at scale. As AI\u2019s generative capabilities increase, users are finding new ways to use the system to help them do more and shift from traditional click reactions to more nuanced, continuous dialogue-oriented feedback. To navigate this evolving user-AI interaction landscape, it is crucial to shift from established task frameworks and relevance evaluations to a more dynamic, bottom-up approach to task identification and user satisfaction evaluation.<\/p>\n\n\n\n

    Key Contributors <\/h5>\n\n\n\n

    Reid Andersen<\/a>, Georg Buscher, Scott Counts<\/a>, Deepak Gupta, Brent Hecht<\/a>, Dhruv Joshi, Sujay Kumar Jauhar<\/a>, Ying-Chun Lin, Sathish Manivannan, Jennifer Neville<\/a>, Nagu Rangan, Chirag Shah, Dolly Sobhani, Siddharth Suri<\/a>, Tara Safavi<\/a>, Jaime Teevan<\/a>, Saurabh Tiwary<\/a>, Mengting Wan<\/a>, Ryen W. White<\/a>, Xia Song<\/a>, Jack W. Stokes, Xiaofeng Xu, and Longqi Yang<\/a>.<\/p>\n\n\n\n


    \n\n\n\n

    [1]<\/a> The research was performed only on fully de-identified interaction data from Copilot (web) consumers. No enterprise data was used per our commitment to enterprise customers. We have taken careful steps to protect user privacy and adhere to strict ethical and responsible AI standards. All personal, private or sensitive information was scrubbed and masked before conversations were used for the research. The access to the dataset is strictly limited to approved researchers. The study was reviewed and approved by our institutional review board (IRB).<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"

    Microsoft researchers are taking a comprehensive and dynamic approach to help Copilot (web) continuously learn from interaction and feedback, improving the AI system and making it increasingly useful for consumers. Learn more.<\/p>\n","protected":false},"author":37583,"featured_media":1017312,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"categories":[1],"tags":[],"research-area":[13556,13545,13555,13559],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-1017150","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-research-area-human-language-technologies","msr-research-area-search-information-retrieval","msr-research-area-social-sciences","msr-locale-en_us","msr-post-option-blog-homepage-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199565],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[702211,722851,901101,144672,643845],"related-projects":[],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Scott Counts","user_id":31471,"display_name":"Scott Counts","author_link":"Scott Counts<\/a>","is_active":false,"last_first":"Counts, Scott","people_section":0,"alias":"counts"},{"type":"user_nicename","value":"Jennifer Neville","user_id":40946,"display_name":"Jennifer Neville","author_link":"Jennifer Neville<\/a>","is_active":false,"last_first":"Neville, Jennifer","people_section":0,"alias":"jenneville"},{"type":"user_nicename","value":"Mengting Wan","user_id":39510,"display_name":"Mengting Wan","author_link":"Mengting Wan<\/a>","is_active":false,"last_first":"Wan, Mengting","people_section":0,"alias":"mengtwan"},{"type":"user_nicename","value":"Ryen W. White","user_id":33481,"display_name":"Ryen W. White","author_link":"Ryen W. White<\/a>","is_active":false,"last_first":"White, Ryen W.","people_section":0,"alias":"ryenw"},{"type":"user_nicename","value":"Longqi Yang","user_id":38790,"display_name":"Longqi Yang","author_link":"Longqi Yang<\/a>","is_active":false,"last_first":"Yang, Longqi","people_section":0,"alias":"loy"}],"msr_type":"Post","featured_image_thumbnail":"\"flowchart","byline":"","formattedDate":"March 27, 2024","formattedExcerpt":"Microsoft researchers are taking a comprehensive and dynamic approach to help Copilot (web) continuously learn from interaction and feedback, improving the AI system and making it increasingly useful for consumers. Learn more.","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1017150"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/37583"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=1017150"}],"version-history":[{"count":45,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1017150\/revisions"}],"predecessor-version":[{"id":1024983,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1017150\/revisions\/1024983"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1017312"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1017150"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=1017150"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=1017150"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1017150"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=1017150"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=1017150"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1017150"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1017150"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1017150"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=1017150"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=1017150"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}