{"id":563946,"date":"2019-01-25T16:37:23","date_gmt":"2019-01-26T00:37:23","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=563946"},"modified":"2019-01-25T16:36:34","modified_gmt":"2019-01-26T00:36:34","slug":"creating-better-ai-partners-a-case-for-backward-compatibility","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/creating-better-ai-partners-a-case-for-backward-compatibility\/","title":{"rendered":"Creating better AI partners: A case for backward compatibility"},"content":{"rendered":"<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-563949\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/AAAI_Creating-better-AI-Partners_AI_Site_01_2019_1400x788-1024x576.png\" alt=\"\" width=\"1024\" height=\"576\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/AAAI_Creating-better-AI-Partners_AI_Site_01_2019_1400x788-1024x576.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/AAAI_Creating-better-AI-Partners_AI_Site_01_2019_1400x788-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/AAAI_Creating-better-AI-Partners_AI_Site_01_2019_1400x788-768x432.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/AAAI_Creating-better-AI-Partners_AI_Site_01_2019_1400x788-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/AAAI_Creating-better-AI-Partners_AI_Site_01_2019_1400x788-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/AAAI_Creating-better-AI-Partners_AI_Site_01_2019_1400x788-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/AAAI_Creating-better-AI-Partners_AI_Site_01_2019_1400x788.png 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p>Artificial intelligence technologies hold great promise as partners in the real world. They\u2019re in the early stages of helping doctors administer care to their patients and lenders determine the risk associated with loan applications, among other examples. But what happens when these systems that users have come to understand and employ in ways that will enhance their work are updated? Sure, we can assume an improvement in accuracy or speed on the part of the agent, a seemingly beneficial change. However, current practices for updating the models used to power AI partners don\u2019t account for how practitioners have learned over time to trust and make use of an agent\u2019s contributions.<\/p>\n<p>Our team\u2014which also includes graduate student <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/homes.cs.washington.edu\/~bansalg\/\">Gagan Bansal; <span class=\"sr-only\"> (opens in new tab)<\/span><\/a><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/horvitz\/\">Microsoft Technical Fellow Eric Horvitz<\/a>; <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.cs.washington.edu\/people\/faculty\/weld\">University of Washington professor Daniel S. Weld<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>; and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/web.eecs.umich.edu\/~wlasecki\/\">University of Michigan assistant professor Walter S. Lasecki<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>\u2014focuses on this crucial step in the life cycle of machine learning models. In <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/Backward_Compatibility_in_AI.pdf\">the work we\u2019re presenting<\/a> next week at the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/aaai.org\/Conferences\/AAAI-19\/\">Association for the Advancement of Artificial Intelligence\u2019s annual conference (AAAI 2019)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, we introduce <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/gagb\/caja\">a platform, openly available on GitHub, to help better understand the human-AI dynamics in these types of settings<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n<p>Updates are usually fueled and motivated either by additional training data or by algorithmic and optimization advances. Currently, an upgrade to an AI system is informed by improvements in model performance alone, often measured in terms of empirical accuracy on benchmark datasets. These traditional metrics on performance of the AI component are not sufficient when the AI technology is used by people to accomplish tasks. Since models sometimes make mistakes, the success of human-AI teams in decision-making relies on the human partner to create a mental model and to learn when to trust the machine so that he or she can successfully decide whether to override a decision. We show in the research we\u2019ll be sharing at AAAI that updates that are not optimized for human-AI teams can cause significant disruptions in the collaboration by violating human trust.<\/p>\n<p>Imagine a doctor using a diagnosis model she has found to be most helpful in cases involving her older patients. Let\u2019s say it\u2019s been 95 percent accurate. After an update, the model sees an overall increase in accuracy for all patients, to 98 percent, but\u2014unbeknownst to the doctor\u2014it introduces new mistakes that lead to poorer performance when applied to older patients. Even though the model has improved, the doctor may take a wrong action, leading to a lower team performance. In fact, through human studies we present in the paper, we show that an update to a more accurate machine-learned model that is incompatible with the mental model of the human user\u2014that is, an updated model making errors on specific cases the previous version was getting right\u2014can hurt team performance instead of improving it. This empirical result provides evidence for undertaking a more comprehensive optimization\u2014one that considers the performance of the human-AI team rather than the performance of only the AI component.<\/p>\n<div id=\"attachment_563952\" style=\"width: 501px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-563952\" class=\"wp-image-563952 size-full\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/figure1-human-ai-teams-perform-better-than-either-team.png\" alt=\"\" width=\"491\" height=\"371\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/figure1-human-ai-teams-perform-better-than-either-team.png 491w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/figure1-human-ai-teams-perform-better-than-either-team-300x227.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/figure1-human-ai-teams-perform-better-than-either-team-80x60.png 80w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/figure1-human-ai-teams-perform-better-than-either-team-240x180.png 240w\" sizes=\"auto, (max-width: 491px) 100vw, 491px\" \/><p id=\"caption-attachment-563952\" class=\"wp-caption-text\">Figure 1: Human-AI teams perform better than either team member alone, but when the AI component is updated, its behavior may violate human expectations. Even if updates increase the AI\u2019s individual performance, they may reduce team performance by making mistakes in regions where humans have learned to trust the AI.<\/p><\/div>\n<h3>Toward compatible updates<\/h3>\n<p>Preferably, we\u2019d like to employ models that have the best accuracy on the given prediction task and are <em>also<\/em> fully compatible with their previous versions, free of any new errors that break user trust. This calls for the previous version to be included in the mathematical optimization of the new model. So far, traditional loss functions attempt to approximate empirical accuracy alone. By incorporating in the loss function the compatibility with the previous model\u2014and therefore compatibility with the potential previous user experience\u2014we aim to minimize update disruption for the whole team. Technically, the loss function is augmented by a dissonance factor that measures the divergence of models for data points where the old model was correct, penalizing newly introduced errors.<\/p>\n<div id=\"attachment_563955\" style=\"width: 985px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-563955\" class=\"wp-image-563955 size-full\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/figure-2-perforamnce-versus-compatibility.png\" alt=\"\" width=\"975\" height=\"392\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/figure-2-perforamnce-versus-compatibility.png 975w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/figure-2-perforamnce-versus-compatibility-300x121.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/figure-2-perforamnce-versus-compatibility-768x309.png 768w\" sizes=\"auto, (max-width: 975px) 100vw, 975px\" \/><p id=\"caption-attachment-563955\" class=\"wp-caption-text\">Figure 2: Performance of the updated model (ROC H2) versus compatibility with previous version (C(h1, h2)) for a logistic regression classifier. The reformulated training objective offers an explorable performance\/compatibility tradeoff, which in this case is more forgiving over the first half of the curves. Each curve\u2014new error, imitation and strict-imitation\u2014represents a different formulation of the dissonance function, further described in the paper.<\/p><\/div>\n<p>In practice, full compatibility and perfect accuracy improvements are hard to achieve. To assist machine learning practitioners in selecting the model best suited for their use case, we propose visualizations that allow engineers to explore the performance\/compatibility tradeoff by varying the weight assigned to the dissonance factor in the loss function. Based on results on datasets from three different real-world high-stake domains, we show the tradeoff curves are generally more flexible (flat) in their first half as we increase the weight on the cost of incompatibility (see Figure 2). This demonstrates that, at the very least, we can choose to train and deploy a more compatible update without significant loss in accuracy. In these cases, a more compatible update would also reduce the effort of user retraining. When the sacrifice in accuracy becomes too great to justify, any resulting negative outcomes from the tradeoff in compatibility will require efficient explanation techniques that can help users better understand how the model has changed.<\/p>\n<div id=\"attachment_563958\" style=\"width: 587px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-563958\" class=\"wp-image-563958 size-full\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/figure-3-user-study-experiments-with-teh-caja-platform.png\" alt=\"\" width=\"577\" height=\"418\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/figure-3-user-study-experiments-with-teh-caja-platform.png 577w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/figure-3-user-study-experiments-with-teh-caja-platform-300x217.png 300w\" sizes=\"auto, (max-width: 577px) 100vw, 577px\" \/><p id=\"caption-attachment-563958\" class=\"wp-caption-text\">Figure 3: Graph reporting human-AI teamwork performance as a function of time in collaboration. User study experiments with the CAJA platform show that an update violating the learned user mental model (at <em>cycle<\/em> 75) deteriorates the team performance following the update.<\/p><\/div>\n<h3>CAJA: A tool for studying human-AI teams<\/h3>\n<p>Beyond the compatible updates problem, human-AI collaboration for decision-making poses multiple challenges at the intersection of human cognition, mathematical optimization, interpretability, and human-computer interaction. Comprehensive studies require experimenting with parameters such as the length of interaction, task and AI complexity, and the overall objective measuring the value of the interaction on real-world problems. To facilitate studies in this field, we developed the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/gagb\/caja\">CAJA platform<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, which supports user studies for tasks captured in an assembly line scenario. The platform abstracts the specifics of problem-solving and focuses on understanding the effect of mental modeling on team success. The tool is in its early phase; we plan to extend the platform so that it also supports collaboration in the presence of explanations and users with varying expertise. CAJA is designed to be used via external crowdsourcing platforms such as Amazon Mechanical Turk. Using this platform, we were able to show more accurate but incompatible updates can indeed hurt team performance while compatible updates leverage the algorithmic improvement by respecting the human mental model.<\/p>\n<p>Developing AI systems that can be effective partners for human decision-makers will require innovations in the way we design and maintain AI systems, as well as further studies on human-AI teamwork. We hope that our AAAI paper and the CAJA platform will help facilitate further progress in this space.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Artificial intelligence technologies hold great promise as partners in the real world. They\u2019re in the early stages of helping doctors administer care to their patients and lenders determine the risk associated with loan applications, among other examples. But what happens when these systems that users have come to understand and employ in ways that will [&hellip;]<\/p>\n","protected":false},"author":37074,"featured_media":563949,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Besmira Nushi","user_id":"36975"},{"type":"user_nicename","value":"Ece Kamar","user_id":"31710"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[241770],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-563946","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-artificial-intelligence","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[144633],"related-projects":[],"related-events":[558264],"related-researchers":[{"type":"user_nicename","value":"Ece Kamar","user_id":31710,"display_name":"Ece Kamar","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/eckamar\/\" aria-label=\"Visit the profile page for Ece Kamar\">Ece Kamar<\/a>","is_active":false,"last_first":"Kamar, Ece","people_section":0,"alias":"eckamar"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/AAAI_Creating-better-AI-Partners_AI_Site_01_2019_1400x788.png\" class=\"img-object-cover\" alt=\"\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/AAAI_Creating-better-AI-Partners_AI_Site_01_2019_1400x788.png 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/AAAI_Creating-better-AI-Partners_AI_Site_01_2019_1400x788-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/AAAI_Creating-better-AI-Partners_AI_Site_01_2019_1400x788-768x432.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/AAAI_Creating-better-AI-Partners_AI_Site_01_2019_1400x788-1024x576.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/AAAI_Creating-better-AI-Partners_AI_Site_01_2019_1400x788-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/AAAI_Creating-better-AI-Partners_AI_Site_01_2019_1400x788-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/AAAI_Creating-better-AI-Partners_AI_Site_01_2019_1400x788-343x193.png 343w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"Besmira Nushi and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/eckamar\/\" title=\"Go to researcher profile for Ece Kamar\" aria-label=\"Go to researcher profile for Ece Kamar\" data-bi-type=\"byline author\" data-bi-cN=\"Ece Kamar\">Ece Kamar<\/a>","formattedDate":"January 25, 2019","formattedExcerpt":"Artificial intelligence technologies hold great promise as partners in the real world. They\u2019re in the early stages of helping doctors administer care to their patients and lenders determine the risk associated with loan applications, among other examples. But what happens when these systems that users&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/563946","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/37074"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=563946"}],"version-history":[{"count":6,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/563946\/revisions"}],"predecessor-version":[{"id":564030,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/563946\/revisions\/564030"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/563949"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=563946"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=563946"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=563946"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=563946"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=563946"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=563946"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=563946"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=563946"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=563946"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=563946"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=563946"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}