{"id":1048506,"date":"2024-07-15T09:00:00","date_gmt":"2024-07-15T16:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=1048506"},"modified":"2024-07-08T07:37:39","modified_gmt":"2024-07-08T14:37:39","slug":"rubicon-evaluating-conversations-between-humans-and-ai-systems","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/rubicon-evaluating-conversations-between-humans-and-ai-systems\/","title":{"rendered":"RUBICON: Evaluating conversations between humans and AI systems"},"content":{"rendered":"\n

This paper has been accepted at the <\/em><\/strong>1st<\/sup> ACM International Conference on AI-powered Software<\/em><\/strong> (opens in new tab)<\/span><\/a> (AIware 2024), co-located with <\/em><\/strong>FSE 2024<\/em><\/strong> (opens in new tab)<\/span><\/a>. AIware is the premier international forum on AI-powered software.<\/em><\/strong><\/p>\n\n\n\n

\"Rubicon<\/figure>\n\n\n\n

Generative AI has redefined the landscape of AI assistants in software development, with innovations like GitHub Copilot providing real-time, chat-based programming support. As these tools increase in sophistication and domain specialization, assessing their impact on user interactions becomes more challenging. Developers frequently question whether modifications to their AI assistants genuinely improve the user experience, as indicated in a recent paper<\/a>.<\/p>\n\n\n\n

\n\t
\n\t\t
\n\t\t\t\t\t\tPublication<\/span>\n\t\t\tRUBICON: Rubric-based Evaluation of Domain Specific Human-AI Conversations<\/span> <\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n

Traditional feedback mechanisms, such as simple thumbs-up or thumbs-down ratings, fall short in capturing the complexities of interactions within specialized settings, where nuanced data is often sparse. To address this issue, we introduce RUBICON: Rubric-based Evaluation of Domain Specific Human-AI Conversations<\/a>,\u201d presented at AIware 2024. RUBICON is an automated assessment technique that transforms a minimal dataset into an extensive array of domain-specific rubrics, helping ensure that updates not only modify but meaningfully improve user interactions.<\/p>\n\n\n\n

Foundational communication principles<\/h2>\n\n\n\n

Effective conversation, whether human-to-human or human-to-AI, adheres to four maxims (opens in new tab)<\/span><\/a> outlined by philosopher Paul Grice: quantity, quality, relation, and manner, ensuring that communication is concise, truthful, pertinent, and clear. In AI applications, they help create interactions that feel natural and engaging, fostering trust and empathy. Within domain-specific settings, RUBICON adapts these principles to ensure they are context-aware, improving the utility and clarity of interactions. For example, in Visual Studio, the AI helps the developer debug a program by providing detailed explanations and relevant code examples, shown in Figure 1. In Figure 2, its responses reflect that it\u2019s guided by context.<\/p>\n\n\n\n

\"In
Figure 1. Contrasting interactions with two versions of the Visual Studio Debugging Assistant for the same task. On the left, the assistant makes assumptions without seeking clarification. On the right, the assistant proactively investigates the error, collaborates with the developer to gather essential information, and achieves a practical solution.<\/figcaption><\/figure>\n\n\n\n
\"In
Figure 2. Context awareness significantly improves the AI assistant\u2019s efficacy. The response on the left is generic, superficially referring to the developer\u2019s code and restating the obvious, providing little value. The reply on the right directs the developer toward a specific solution, the toJSON method.<\/figcaption><\/figure>\n\n\n\n

In task-oriented environments, it\u2019s important to assess how well a conversation aligns with user expectations and assists in achieving their goals. Conversations are only useful if they advance the user’s interests, and challenges can arise when users have misaligned expectations of the AI\u2019s capabilities or when the AI directs the conversation too forcefully, prioritizing its methods over the user\u2019s preferences. RUBICON balances the interaction dynamics between the AI and developer, promoting constructive exchanges without overwhelming or under-engaging. It calibrates the extent to which the AI should hypothesize and resolve issues versus how much it should leave to the developer.<\/p>\n\n\n\n\t

\n\t\t\n\n\t\t

\n\t\tSpotlight: Event Series<\/span>\n\t<\/p>\n\t\n\t

\n\t\t\t\t\t\t
\n\t\t\t\t\n\t\t\t\t\t\"Research\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t
\n\n\t\t\t\t\t\t\t\t\t

Microsoft Research Forum<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t

Join us for a continuous exchange of ideas about research in the era of general AI. Watch the first four episodes on demand.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t

\n\t\t\t\t\t
\n\t\t\t\t\t\t\n\t\t\t\t\t\t\tWatch on-demand\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t<\/div>\n\t<\/div>\n\t\n\n\n

RUBICON\u2019s rubric-based method and evaluation<\/h2>\n\n\n\n

RUBICON is built on the foundational work of SPUR<\/a>\u2014the Supervised Prompting for User Satisfaction Rubrics framework that was recently introduced\u2014increasing its scope and crafting a broad spectrum of potential rubrics from each batch of data. Using a language model to create concise summaries that assess the quality of conversations, emphasizing communication principles, task orientation, and domain specificity. It identifies signals of user satisfaction and outlines the shared responsibilities of the user and the AI in achieving task objectives. These summaries are then refined into rubrics.<\/p>\n\n\n\n

RUBICON\u2019s novel selection algorithm sifts through numerous candidates to identify a select group of high-quality rubrics, enhancing their predictive accuracy in practical applications, as illustrated in Figure 3. The technique doesn\u2019t require human intervention and can be trained directly on anonymized conversational data, helping to ensure customer data privacy while still extracting the important features for analysis.<\/p>\n\n\n\n

\"The
Figure 3. Overview of RUBICON\u2019s framework and the various steps involved.<\/figcaption><\/figure>\n\n\n\n

The effectiveness of RUBICON\u2019s method is evidenced by its rubrics, which show an 18% increase in accuracy over SPUR in classifying conversations as positive or negative, as shown in Figure 4. Additionally, RUBICON achieves near-perfect precision in predicting conversation labels in 84% of cases involving unlabeled data.<\/p>\n\n\n\n

\"The
Figure 4. Two analogous conversations facilitated by the Debugger AI assistant are evaluated against representative rubrics. Software engineers who evaluated the conversations found the one on the left less effective and the one on the right more so. RUBICON’s rubric also gave a higher score to the conversation on the right, demonstrating that RUBICON’s method of evaluation is consistent with that of the software engineers.<\/figcaption><\/figure>\n\n\n\n

RUBICON-generated rubrics <\/h2>\n\n\n\n

RUBICON-generated rubrics serve as a framework for understanding user needs, expectations, and conversational norms. These rubrics have been successfully implemented in Visual Studio IDE, where they have guided analysis of over 12,000 debugging conversations, offering valuable insights into the effectiveness of modifications made to the assistant and facilitating rapid fast iteration and improvement.\u00a0For example, the rubrics \u201c<\/em>The AI gave a solution too quickly, rather than asking the user for more information and trying to find the root cause of the issue,\u201d or \u201cThe AI gave a mostly surface-level solution to the problem,\u201d have indicated issues where the assistant prematurely offered solutions without gathering sufficient information. These findings led to adjustments in the AI\u2019s behavior, making it more investigative and collaborative.\u00a0<\/p>\n\n\n\n

Beyond conversational dynamics, the rubrics also identify systemic design flaws not directly tied to the conversational assistant. These include issues with the user interface issues that impede the integration of new code and gaps in user education regarding the assistant\u2019s capabilities. To use RUBICON, developers need a small set of labeled conversations from their AI assistant and specifically designed prompts that reflect the criteria for task progression and completion. The methodology and example of these rubrics are detailed in the paper<\/a>.<\/p>\n\n\n\n

Implications and looking ahead<\/h2>\n\n\n\n

Developers of AI assistance value clear insights into the performance of their interfaces. RUBICON represents a valuable step toward developing a refined evaluation system that is sensitive to domain-specific tasks, adaptable to changing usage patterns, efficient, easy-to-implement, and privacy-conscious. A robust evaluation system like RUBICON can help to improve the quality of these tools without compromising user privacy or data security. As we look ahead, our goal is to broaden the applicability of RUBICON beyond just debugging in AI assistants like GitHub Copilot. We aim to support additional tasks like migration and scaffolding within IDEs, extending its utility to other chat-based Copilot experiences across various products.<\/p>\n","protected":false},"excerpt":{"rendered":"

RUBICON evaluates AI-driven conversations and improves their quality by learning detailed domain-specific rubrics from minimal data. It gathers insights on AI assistant performance while maintaining user privacy and data security.<\/p>\n","protected":false},"author":42735,"featured_media":1048530,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556,13560],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-1048506","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-research-area-programming-languages-software-engineering","msr-locale-en_us","msr-post-option-blog-homepage-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[663303],"related-projects":[],"related-events":[],"related-researchers":[{"type":"guest","value":"param-biyani","user_id":"1048509","display_name":"Param Biyani","author_link":"Param Biyani<\/a>","is_active":true,"last_first":"Biyani, Param","people_section":0,"alias":"param-biyani"},{"type":"user_nicename","value":"Yasharth Bajpai","user_id":42228,"display_name":"Yasharth Bajpai","author_link":"Yasharth Bajpai<\/a>","is_active":false,"last_first":"Bajpai, Yasharth","people_section":0,"alias":"ybajpai"},{"type":"user_nicename","value":"Arjun Radhakrishna","user_id":39405,"display_name":"Arjun Radhakrishna","author_link":"Arjun Radhakrishna<\/a>","is_active":false,"last_first":"Radhakrishna, Arjun","people_section":0,"alias":"arradha"},{"type":"user_nicename","value":"Gustavo Soares","user_id":39183,"display_name":"Gustavo Soares","author_link":"Gustavo Soares<\/a>","is_active":false,"last_first":"Soares, Gustavo","people_section":0,"alias":"gsoares"},{"type":"user_nicename","value":"Sumit Gulwani","user_id":33755,"display_name":"Sumit Gulwani","author_link":"Sumit Gulwani<\/a>","is_active":false,"last_first":"Gulwani, Sumit","people_section":0,"alias":"sumitg"}],"msr_type":"Post","featured_image_thumbnail":"\"Rubicon","byline":"","formattedDate":"July 15, 2024","formattedExcerpt":"RUBICON evaluates AI-driven conversations and improves their quality by learning detailed domain-specific rubrics from minimal data. It gathers insights on AI assistant performance while maintaining user privacy and data security.","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1048506","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/42735"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=1048506"}],"version-history":[{"count":27,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1048506\/revisions"}],"predecessor-version":[{"id":1050819,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1048506\/revisions\/1050819"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1048530"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1048506"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=1048506"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=1048506"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1048506"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=1048506"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=1048506"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1048506"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1048506"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1048506"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=1048506"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=1048506"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}