{"id":622608,"date":"2019-12-02T08:59:34","date_gmt":"2019-12-02T16:59:34","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=622608"},"modified":"2019-12-02T08:22:11","modified_gmt":"2019-12-02T16:22:11","slug":"the-road-less-traveled-with-successor-uncertainties-rl-agents-become-better-informed-explorers","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/the-road-less-traveled-with-successor-uncertainties-rl-agents-become-better-informed-explorers\/","title":{"rendered":"The road less traveled: With Successor Uncertainties, RL agents become better informed explorers"},"content":{"rendered":"
(opens in new tab)<\/span><\/a><\/p>\n Imagine moving to a new city. You want to get from your new home to your new job. Unfamiliar with the area, you ask your co-workers for the best route, and as far as you can tell … they\u2019re right! You get to work and back easily. But as you acclimate, you begin to wonder: Is there a more scenic route, perhaps, or a route that passes by a good coffee spot? The fundamental question then becomes do you stick with what you already know\u2014a route largely without issues\u2014or do you seek to learn more, potentially finding a better way. Rely on prior experience too much, and you miss out on a pretty drive or that much-needed quality cup of joe. Spend too much time investigating, and you risk delay, ruining a good thing. As humans, we learn to balance exploiting what we know and exploring what we don\u2019t, an important skill researchers in reinforcement learning are trying to achieve when it comes to AI agents.<\/p>\n In our paper \u201cSuccessor Uncertainties: Exploration and Uncertainty in Temporal Difference Learning,\u201d (opens in new tab)<\/span><\/a> which is being presented at the 33rd Conference on Neural Information Processing Systems (NeurIPS) (opens in new tab)<\/span><\/a>, we\u2019ve taken a step toward enabling efficient exploration by proposing Successor Uncertainties (SU), a simple yet effective algorithm for implementing and achieving scalable model-free exploration with posterior sampling. SU is a randomized value function (RVF) algorithm, which means it directly models a posterior distribution over plausible state-action functions, allowing for deep exploration\u2014that is, non-myopic exploration strategies. Compared with many other exploration schemes based on RVF and posterior sampling, SU is more effective in sparse reward environments, overcoming some theoretical shortcomings of previous approaches, and is cheaper computationally.<\/p>\n An AI agent deployed in a new environment has to make a tradeoff between exploration\u2014learning about behavior that leads to high cumulative rewards\u2014and exploitation\u2014leveraging knowledge about the environment that it has already acquired.<\/p>\n Simple exploration schemes like epsilon-greedy explore and learn new environments via minor, random modifications. In our example of commuting to work in a new city, that would be akin to taking a random couple of turns to go down a new street not too far off from the original route to see if there are any good coffee shops along that way. That approach isn\u2019t as efficient, though, as let\u2019s say taking a route in which you observe seven new streets and the types of businesses that exist there. The greater deviation allows you to obtain more information\u2014information about seven never-before-seen streets\u2014for future decision-making. This is referred to as deep exploration, and it requires an agent to take systematic actions over a longer time period. To do that, an AI agent has to consider what it doesn\u2019t know and experiment with behavior that could be beneficial under its current knowledge. This can be formalized by an agent\u2019s uncertainty about the cumulative reward it might achieve by following a certain behavior.<\/p>\n SU represents an agent\u2019s uncertainty about the value of actions in a specific state of the environment and is constructed to satisfy posterior sampling policy matching<\/em>, a property ensuring that acting greedily with respect to samples from SU leads to behavior consistent with the current agent\u2019s knowledge of the transition frequencies and reward distribution. This enables the agent to focus its exploration on promising behavior.<\/p>\n Many previous methods using RVF and posterior sampling focus on propagation of uncertainty, which ensures that the agent\u2019s uncertainties quantify not only the uncertainty about the immediate reward of an action, but also the uncertainty about the reward that could be achieved in the next step, the step after that, and so on. Based on these uncertainties, the agent can often efficiently experiment with behavior for which the long-term outcome is uncertain. Surprisingly, as we show in our paper, propagation of uncertainty does not guarantee efficient exploration via posterior sampling. While SU satisfies propagation of uncertainty, it also satisfies posterior sampling policy matching, making it explicitly focus on the exploration induced by the model through posterior sampling. In contrast with methods that only focus on propagation of uncertainty, SU performs well on sparse reward problems.<\/p>\n SU captures an agent\u2019s uncertainty about achievable rewards by combining successor features, Bayesian linear regression, and neural network function approximation in a novel way to form expressive posterior distributions over state-action functions. Successor features enable us to estimate the expected features counts when following a particular behavior in a temporally consistent manner; Bayesian linear regression enables us to use those successor features to induce distributions over state-action functions; and function approximation makes our approach practical in challenging applications with large action spaces.<\/p>\nAI the explorer<\/strong><\/h3>\n
Considering the unknown<\/strong><\/h3>\n