{"id":622608,"date":"2019-12-02T08:59:34","date_gmt":"2019-12-02T16:59:34","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=622608"},"modified":"2019-12-02T08:22:11","modified_gmt":"2019-12-02T16:22:11","slug":"the-road-less-traveled-with-successor-uncertainties-rl-agents-become-better-informed-explorers","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/the-road-less-traveled-with-successor-uncertainties-rl-agents-become-better-informed-explorers\/","title":{"rendered":"The road less traveled: With Successor Uncertainties, RL agents become better informed explorers"},"content":{"rendered":"<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/11\/NeurIPS-SuccessorUncertanties-V5-1400x788.gif\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-624579\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/11\/NeurIPS-SuccessorUncertanties-V5-1400x788.gif\" alt=\"Animation of AI agent finding coffee\" width=\"1400\" height=\"788\" \/><span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/p>\n<p>Imagine moving to a new city. You want to get from your new home to your new job. Unfamiliar with the area, you ask your co-workers for the best route, and as far as you can tell &#8230; they\u2019re right! You get to work and back easily. But as you acclimate, you begin to wonder: Is there a more scenic route, perhaps, or a route that passes by a good coffee spot? The fundamental question then becomes do you stick with what you already know\u2014a route largely without issues\u2014or do you seek to learn more, potentially finding a better way. Rely on prior experience too much, and you miss out on a pretty drive or that much-needed quality cup of joe. Spend too much time investigating, and you risk delay, ruining a good thing. As humans, we learn to balance exploiting what we know and exploring what we don\u2019t, an important skill researchers in reinforcement learning are trying to achieve when it comes to AI agents.<\/p>\n<p>In our paper <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/successor-uncertainties-exploration-and-uncertainty-in-temporal-difference-learning-2\/\">\u201cSuccessor Uncertainties: Exploration and Uncertainty in Temporal Difference Learning,\u201d<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> which is being presented at the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/neurips.cc\/\">33rd Conference on Neural Information Processing Systems (NeurIPS)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, we\u2019ve taken a step toward enabling efficient exploration by proposing Successor Uncertainties (SU), a simple yet effective algorithm for implementing and achieving scalable model-free exploration with posterior sampling. SU is a randomized value function (RVF) algorithm, which means it directly models a posterior distribution over plausible state-action functions, allowing for deep exploration\u2014that is, non-myopic exploration strategies. Compared with many other exploration schemes based on RVF and posterior sampling, SU is more effective in sparse reward environments, overcoming some theoretical shortcomings of previous approaches, and is cheaper computationally.<\/p>\n<h3><strong>AI the explorer<\/strong><\/h3>\n<p>An AI agent deployed in a new environment has to make a tradeoff between exploration\u2014learning about behavior that leads to high cumulative rewards\u2014and exploitation\u2014leveraging knowledge about the environment that it has already acquired.<\/p>\n<p>Simple exploration schemes like epsilon-greedy explore and learn new environments via minor, random modifications. In our example of commuting to work in a new city, that would be akin to taking a random couple of turns to go down a new street not too far off from the original route to see if there are any good coffee shops along that way. That approach isn\u2019t as efficient, though, as let\u2019s say taking a route in which you observe seven new streets and the types of businesses that exist there. The greater deviation allows you to obtain more information\u2014information about seven never-before-seen streets\u2014for future decision-making. This is referred to as deep exploration, and it requires an agent to take systematic actions over a longer time period. To do that, an AI agent has to consider what it doesn\u2019t know and experiment with behavior that could be beneficial under its current knowledge. This can be formalized by an agent\u2019s uncertainty about the cumulative reward it might achieve by following a certain behavior.<\/p>\n<h3><strong>Considering the unknown<\/strong><\/h3>\n<p>SU represents an agent\u2019s uncertainty about the value of actions in a specific state of the environment and is constructed to satisfy <em>posterior sampling policy matching<\/em>, a property ensuring that acting greedily with respect to samples from SU leads to behavior consistent with the current agent\u2019s knowledge of the transition frequencies and reward distribution. This enables the agent to focus its exploration on promising behavior.<\/p>\n<p>Many previous methods using RVF and posterior sampling focus on propagation of uncertainty, which ensures that the agent\u2019s uncertainties quantify not only the uncertainty about the immediate reward of an action, but also the uncertainty about the reward that could be achieved in the next step, the step after that, and so on. Based on these uncertainties, the agent can often efficiently experiment with behavior for which the long-term outcome is uncertain. Surprisingly, as we show in our paper, propagation of uncertainty does not guarantee efficient exploration via posterior sampling. While SU satisfies propagation of uncertainty, it also satisfies posterior sampling policy matching, making it explicitly focus on the exploration induced by the model through posterior sampling. In contrast with methods that only focus on propagation of uncertainty, SU performs well on sparse reward problems.<\/p>\n<p>SU captures an agent\u2019s uncertainty about achievable rewards by combining successor features, Bayesian linear regression, and neural network function approximation in a novel way to form expressive posterior distributions over state-action functions. Successor features enable us to estimate the expected features counts when following a particular behavior in a temporally consistent manner; Bayesian linear regression enables us to use those successor features to induce distributions over state-action functions; and function approximation makes our approach practical in challenging applications with large action spaces.<\/p>\n<div id=\"attachment_622611\" style=\"width: 2057px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/11\/bars-300.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-622611\" class=\"size-full wp-image-622611\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/11\/bars-300.jpg\" alt=\"Successor Uncertainties was compared to several popular existing randomized value function methods on the Atari 2600 video games benchmark.\" width=\"2047\" height=\"717\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/11\/bars-300.jpg 2047w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/11\/bars-300-300x105.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/11\/bars-300-1024x359.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/11\/bars-300-768x269.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/11\/bars-300-1536x538.jpg 1536w\" sizes=\"(max-width: 2047px) 100vw, 2047px\" \/><p id=\"caption-attachment-622611\" class=\"wp-caption-text\"><span class=\"sr-only\"> (opens in new tab)<\/span><\/a> Successor Uncertainties was compared to several popular existing randomized value function methods on the Atari 2600 video games benchmark. The above bars show the difference in human normalized scores between SU and bootstrapped Deep Q-Networks (bootstrapped DQN; top), uncertainty Bellman equation (UBE; middle), and Deep Q-Networks (DQN; bottom) for each of the 49 Atari 2600 games. Blue indicates SU performed better; red indicates it performed worse. SU outperforms the baselines on 36\/49, 43\/49, and 42\/49 games, respectively. Y-axis values have been clipped to [\u22122.5, 2.5].<\/p><\/div>\n<h3><strong>Appealing theoretical properties and strong empirical performance <\/strong><\/h3>\n<p>SU has appealing theoretical properties that translate into efficient exploration. This is demonstrated by remarkable performance on hard tabular exploration problems. SU also easily scales to real-world applications with large state spaces such as video games, in which we have to perform exploration purely from visual frames and in which reward signals can be very sparse. In particular, we demonstrate strong empirical performance on the benchmark of Atari 2600 video games.<\/p>\n<p>SU is easy to implement within existing model-free reinforcement learning pipelines, and because exploring is more efficient with SU, reinforcement learning becomes more data efficient.<\/p>\n<p>Our work on Successor Uncertainties is part of a larger body of research called <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/group\/game-intelligence\/\">Game Intelligence<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, which is geared toward developing algorithms for enabling new game experiences driven by intelligent agents. Game environments such as Minecraft, which has been unlocked for AI experimentation through our <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/project-malmo\/\">Project Malmo<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> platform, are fantastic places to develop agents that can collaborate with players. Our long-term goal is to create such collaborative and socially aware game agents as a stepping stone toward better AI assistants. This work, and understanding how agents can explore more meaningfully and learn faster, is key in that endeavor.<\/p>\n<p>The SU source code is <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/DavidJanz\/successor_uncertainties_atari\">available online<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and we encourage you to give SU a try when tackling your next reinforcement learning problem.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Imagine moving to a new city. You want to get from your new home to your new job. Unfamiliar with the area, you ask your co-workers for the best route, and as far as you can tell &#8230; they\u2019re right! You get to work and back easily. But as you acclimate, you begin to wonder: [&hellip;]<\/p>\n","protected":false},"author":38679,"featured_media":624582,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-622608","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199561],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[583324,694878],"related-projects":[235753],"related-events":[609480],"related-researchers":[{"type":"user_nicename","value":"Katja Hofmann","user_id":32468,"display_name":"Katja Hofmann","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/kahofman\/\" aria-label=\"Visit the profile page for Katja Hofmann\">Katja Hofmann<\/a>","is_active":false,"last_first":"Hofmann, Katja","people_section":0,"alias":"kahofman"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/11\/NeurIPS-SuccessorUncertanties-V5-1400x788-960x540.png\" class=\"img-object-cover\" alt=\"Image of an AI agent finding a cup of coffee\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/11\/NeurIPS-SuccessorUncertanties-V5-1400x788-960x540.png 960w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/11\/NeurIPS-SuccessorUncertanties-V5-1400x788-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/11\/NeurIPS-SuccessorUncertanties-V5-1400x788-1024x576.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/11\/NeurIPS-SuccessorUncertanties-V5-1400x788-768x432.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/11\/NeurIPS-SuccessorUncertanties-V5-1400x788-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/11\/NeurIPS-SuccessorUncertanties-V5-1400x788-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/11\/NeurIPS-SuccessorUncertanties-V5-1400x788-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/11\/NeurIPS-SuccessorUncertanties-V5-1400x788-640x360.png 640w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/11\/NeurIPS-SuccessorUncertanties-V5-1400x788-1280x720.png 1280w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/11\/NeurIPS-SuccessorUncertanties-V5-1400x788.png 1400w\" sizes=\"(max-width: 960px) 100vw, 960px\" \/>","byline":"Sebastian Tschiatschek and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/kahofman\/\" title=\"Go to researcher profile for Katja Hofmann\" aria-label=\"Go to researcher profile for Katja Hofmann\" data-bi-type=\"byline author\" data-bi-cN=\"Katja Hofmann\">Katja Hofmann<\/a>","formattedDate":"December 2, 2019","formattedExcerpt":"Imagine moving to a new city. You want to get from your new home to your new job. Unfamiliar with the area, you ask your co-workers for the best route, and as far as you can tell ... they\u2019re right! You get to work and&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/622608"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/38679"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=622608"}],"version-history":[{"count":4,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/622608\/revisions"}],"predecessor-version":[{"id":624705,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/622608\/revisions\/624705"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/624582"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=622608"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=622608"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=622608"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=622608"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=622608"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=622608"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=622608"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=622608"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=622608"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=622608"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=622608"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}