{"id":663771,"date":"2020-06-03T03:00:00","date_gmt":"2020-06-03T10:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=663771"},"modified":"2020-06-18T07:28:27","modified_gmt":"2020-06-18T14:28:27","slug":"provably-efficient-reinforcement-learning-with-dr-akshay-krishnamurthy","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/podcast\/provably-efficient-reinforcement-learning-with-dr-akshay-krishnamurthy\/","title":{"rendered":"Provably efficient reinforcement learning with Dr. Akshay Krishnamurthy"},"content":{"rendered":"
MSR\u2019s New York City lab is home to some of the best reinforcement learning research on the planet but if you ask any of the researchers, they\u2019ll tell you they\u2019re very interested in getting it out of the lab and into the real world. One of those researchers is Dr. Akshay Krishnamurthy (opens in new tab)<\/span><\/a> and today, he explains how his work on feedback-driven data collection and provably efficient reinforcement learning algorithms is helping to move the RL needle in the real-world direction.<\/p>\n Akshay\u00a0Krishnamurthy:\u00a0When you have casual conversations with people, you typically emphasize the, like, final performance and when you talk about these high-profile achievements, we don\u2019t often spend much time scrutinizing the sample efficiency. And one thing that we\u2019ve been kicking around a little bit\u00a0is actually, can\u00a0we come up with these empirical test suites or benchmarks that really put data efficiency at the forefront?<\/p>\n Host:\u00a0<\/b>You\u2019re listening to the Microsoft Research Podcast, a show that brings you closer to the cutting-edge of technology research and the scientists behind it. I\u2019m your host, Gretchen Huizinga.<\/b><\/p>\n Host: MSR\u2019s New York City lab is home to some of the best reinforcement learning research on the planet but if you ask any of the researchers, they\u2019ll tell you they\u2019re very interested in getting it\u00a0<\/b>out<\/i><\/b>\u00a0of the lab and\u00a0<\/b>into<\/i><\/b>\u00a0the real world. One of those researchers is Dr.\u00a0<\/b>Akshay<\/b>\u00a0Krishnamurthy and today, he explains how his work on feedback-driven data collection and provably efficient reinforcement learning algorithms is helping to move the RL needle in the real-world direction.<\/b>\u00a0<\/b>That and much more on this episode of the Microsoft Research Podcast.<\/b><\/p>\n Host:\u00a0<\/b>Akshay<\/b>\u00a0Krishnamurthy, welcome to the podcast.<\/b><\/p>\n Akshay\u00a0Krishnamurthy: Thanks for having me.<\/p>\n Host:\u00a0<\/b>Let\u2019s start by situating<\/b>. Y<\/b>ou\u2019re a\u00a0<\/b>P<\/b>rincipal\u00a0<\/b>R<\/b>esearcher at MSR in New York City and part of\u00a0<\/b>what I\u2019d like to call\u00a0<\/b>a small but tactical unit working on reinforcement learning in real world settings<\/b>,\u00a0<\/b>both for the short-term, you call<\/b>ed<\/b>\u00a0that translational, and the long-term.<\/b>\u00a0<\/b>So<\/b>,<\/b>\u00a0to kick things off today, tell us how these two approaches are different<\/b>,<\/b>\u00a0and how they are being tackled in your lab<\/b>,<\/b>\u00a0and then talk about how they are each making an impact on the science of reinforcement learning, both practically and theoretically.<\/b><\/p>\n Akshay\u00a0Krishnamurthy: I like to categorize our projects in terms of the time horizon with which we might push things into a product. And the first,\u00a0kind of,\u00a0bucket here is more translational efforts. These are the kinds of things that maybe we hope to put in production in the next year or two years. And then the other bucket is maybe more blue-sky kind of efforts, things that maybe are five years out,\u00a0or ten years out,\u00a0or you know, just more high-risk kind of projects. Let me talk a little bit about the translational stuff.<\/p>\n Host:\u00a0<\/b>Yeah.<\/b><\/p>\n Akshay\u00a0Krishnamurthy: So, our group maintains and develops this RL system that is currently deployed in production in several settings. Most of these are recommendation kind of settings.<\/p>\n Host:\u00a0<\/b>Mmm<\/b>-hmm.<\/b><\/p>\n Akshay\u00a0Krishnamurthy: And we\u2019re constantly adding features. We have this amazing group of engineers and data scientists working on this product and the researchers are adding features to the system based on new research\u00a0and also,\u00a0kind of based on requests and feedback from product groups. So this is I think a really cool way to do research where you\u2019re working with a product group on some deployments and then they come to you and say hey, we also want to do X, and then we think about it and we\u2019re like oh, we actually don\u2019t know how to do X. So, then we, like, go do some research and try and figure out how to do that kind of thing and then we circle back with the product group maybe in six months or one year when we figure out something.<\/p>\n Host:\u00a0<\/b>Okay.<\/b><\/p>\n Akshay\u00a0Krishnamurthy: And one example of this is\u00a0actually this\u00a0project we did on recommendation when the sort of decision space that the algorithm is operating over is very, very rich and typically, like, of a combinatorial nature. So, the prototypical example here is personalized news. So, you don\u2019t just choose like one story to show to the user, you might choose like ten stories and you\u00a0have to\u00a0rank them. So now your decision space is much, much more complicated and it becomes much more difficult,\u00a0statistically,\u00a0to figure out what sort of recommendation rules are better. And so we sort of came up with some technique that\u2019s inspired by a lot of literature in learning theory community and we brought this back to the news team, and I think,\u00a0in the last one year, this has sort\u00a0of\u00a0been making its way to production\u00a0in this\u00a0system that we\u2019re running. So that\u2019s kind of like what the translational efforts look like. And the other side of the story is this blue sky or long-term stuff and here,\u00a0I think,\u00a0we\u2019re not that attached to applications when I think about this style of research. I mostly am thinking about, what are the really challenging problems in reinforcement learning and how can we come up with models and algorithms that address those issues? I don\u2019t really know the timescale that these things will pan out, right? I think these problems are super hard. I\u2019m\u00a0definitely not\u00a0the only one thinking about them.\u00a0You know we\u2019re making progress, but it\u2019s just not exactly clear to me when these things will make it to applications,\u00a0but the hope is that, you know, as we build our understanding, we\u2019ll be able to do that.<\/p>\n Host:\u00a0<\/b>Well let\u2019s situate you,\u00a0<\/b>Akshay<\/b>. Researchers often live in an intersection<\/b>,<\/b>\u00a0or they like to say they do. So, what\u2019s your address? Where does your research passion lie and what gets you up in the morning?<\/b><\/p>\n Akshay\u00a0Krishnamurthy: Yeah, so these days I\u2019m pretty interested in the\u00a0statistical issues around reinforcement learning. But this, I think for me, is part of a bigger story.\u00a0I\u2019ve always been interested in interactive learning which, typically, when I tell this to people, they think of something else, but I usually use this to refer to any sort of learning setting where the algorithm engages in feedback-driven data\u00a0collection.\u00a0And\u00a0this is, I think, one of the central problems in reinforcement learning. The agent operates in this world, makes a bunch of decisions, and the decisions it makes influence the data that the agent has. And so, part of the challenge is\u00a0actually\u00a0learning\u00a0to make the right decisions to collect the right data.\u00a0So that sort of where I came from.\u00a0I\u2019ve always been interested in this sort of like broader topic of designing algorithms to collect the right data. And the last four or five years, it\u2019s mostly been about reinforcement learning because that\u2019s a very hard version of that problem.<\/p>\n Host:\u00a0<\/b>Well let\u2019s go a little deeper on the theoretical for a moment and talk about what you refer<\/b>red<\/b>\u00a0to as incentive structures in scientific circles. So, what does the current landscape look like, particularly with regards to how benchmarks are set up and measured, and how we might be thinking differently about incentive structures so that the reinforcement learning ball moves forward?<\/b><\/p>\n Akshay\u00a0Krishnamurthy: I think data efficiency is one of the core issues in RL.\u00a0I was just mentioning this.\u00a0And so, I think most researchers are interested in moving towards real world applications. And the ones that people typically kick around are recommendation, high-precision medicine, personalized medicine, robotics, these kinds of things. And in these settings, we may not have the luxury of collecting millions or billions of interactions before we produce some high-quality policy. So, I feel the real-world applications are quite different from the kinds of simulated environments that we are currently using for benchmarking. And, don\u2019t get me wrong, the benchmarks are amazing.\u00a0They\u2019ve really pushed the field forward.\u00a0But I also am kind of mindful about the way benchmarks also influence the community,\u00a0so there\u2019s\u00a0some kind of feedback\u00a0loop here.<\/p>\n Host: Right.<\/b><\/p>\n Akshay\u00a0Krishnamurthy:\u00a0And\u00a0so, these benchmarks kind of overemphasizes quality of final policy at the expense of, how much experience did you need to get there?\u00a0And I think,\u00a0when you have casual conversations with people, you typically emphasize the,\u00a0like,\u00a0final performance and when you talk about these high-profile achievements, we don\u2019t often spend much time scrutinizing the sample efficiency. And one thing that we\u2019ve been kicking around a little bit\u00a0is actually, can\u00a0we come up with these empirical test suites or benchmarks that really put data efficiency at the forefront?<\/p>\n Host: Right.<\/b><\/p>\n Akshay\u00a0Krishnamurthy:\u00a0And I think these are kind of things that hopefully will catch on and\u00a0actually like\u00a0nudge the community back to thinking about statistical issues.<\/p>\n Host:\u00a0<\/b>Right.\u00a0<\/b>Let\u2019s get down to the nitty gritty and talk about math. The beating heart of RL research is algorithms and you told me they need to be able to do three things. So, I am going to have you get down to the nitty gritty on some impressive algorithms you\u2019ve been working on in a minute. But first, give us a framework for the research. What things are reinforcement learning algorithms supposed to be able to do, and where are we on the path towards \u201ccheap, fast, good, pick three\u201d instead of \u201cpick two?\u201d<\/b><\/p>\n Akshay\u00a0Krishnamurthy: This is actually something I took from one of John Langford\u2019s talks, and the core capabilities of RL algorithms are credit assignment,\u00a0which is when you make a sequence of decisions and observe a certain outcome, you know how to allocate, like, which decisions were responsible for that outcome. And this is not so easy because maybe you did one hundred things and then at the end, you know, you are at a pot of gold and is it because you turned left over here or turned right over there or whatever?\u00a0That\u2019s credit assignment. The other issue is exploration. So, you\u2019re an agent, you\u2019re in some complicated environments.\u00a0You run around for some time. If you run around blindly,\u00a0you will not, you know, find all the corners of every nook and cranny that you might need\u00a0to,\u00a0you know, solve your problem,\u00a0so you have to explore the environment. And the other one is generalization. If your environment is relatively complicated, you might not actually see the same thing twice. And what you\u2019d hope for is that the skills you learn and the patterns you pick up in one place can transfer to other situations. So, this is this thing that my colleague\u00a0Dipendra\u00a0always talks about. He\u2019s saying, I learned how to bike in Brooklyn and when I go to Amsterdam, I should also know how to bike. So,\u00a0we should be able to generalize our experience to new experiences.<\/p>\n Host:\u00a0<\/b>Okay.<\/b><\/p>\n Akshay\u00a0Krishnamurthy: So those are the three capabilities and when you take any pair of these right now, we have relatively mature algorithms that often are used in production, and\u00a0we have like a pretty solid theory for how to do two of three. So, I think the one that is\u00a0pretty common\u00a0these days is doing credit assignment and generalization without exploration.<\/p>\n Host:\u00a0<\/b>Hmm.<\/b><\/p>\n Akshay\u00a0Krishnamurthy: And so,\u00a0what\u2019s going on in a lot of these policy search kind of algorithms, like maybe you heard of policy gradient methods. And so, what these algorithms do is,\u00a0you have some current policy. You execute it perhaps with some noise to sort of deviate a little bit. And you pick up some signal about what local changes to your policy are doing, and this is sort of solving the credit assignment. This deviation is addressing the credit assignment aspect. And your policy is maybe parameterized by some neural network and so that\u2019s handling the generalization aspect. You don\u2019t have to keep track of every single state you might see. So,\u00a0then\u00a0you do these local deviations and you make small improvements on your policy.\u00a0So,\u00a0these algorithms are actually pretty effective and they work really well in,\u00a0like,\u00a0continuous controlled kind of problems but they kind of are doing a local search and so they don\u2019t really work very well in problems where you have to do very sophisticated exploration.\u00a0But\u00a0maybe\u00a0one\u00a0research question is,\u00a0can we bring in exploration into this style of algorithms?\u00a0We\u00a0certainly don\u2019t have robust algorithms for doing all three. The blue-sky optimism is that we can do all three.<\/p>\n Host: Yeah.\u00a0<\/b><\/p>\n Akshay\u00a0Krishnamurthy:\u00a0And I think even in this \u201ccheap, fast and good\u201d the definitions keep moving.\u00a0So certainly, I think now, like if you think about with the advent of the internet, we can access information very cheaply, very quickly and it\u2019s very high-quality.<\/p>\n Host: Yeah.\u00a0<\/b><\/p>\n Akshay\u00a0Krishnamurthy: And it\u2019s very different from maybe what it was thirty years ago or something.<\/p>\n Host: Absolutely.<\/b><\/p>\n Akshay\u00a0Krishnamurthy:\u00a0So,\u00a0the definitions keep moving and that\u2019s fine.\u00a0But I think we should\u00a0actually be\u00a0optimistic to push for these things that maybe seem almost impossible.<\/p>\n (music plays)<\/i><\/b><\/p>\n Host:\u00a0<\/b>Well let\u2019s talk about some specific algorithms since we\u2019re on that track right now. The first one is one you call Policy Cover by Inductive Decoding<\/b>,<\/b>\u00a0or PCID<\/b>, a<\/b>nd the exploration around this work was published in 2019 in an ICML paper titled \u2013 ready?\u00a0<\/b>Provably efficient reinforcement learning with rich observations<\/i><\/b>.\u00a0<\/b>Tell us about PCID and how it moved the needle on provably data<\/b>–<\/b>efficient methods. What were the key claims and contributions of this work?<\/b><\/p>\n Akshay\u00a0Krishnamurthy:\u00a0This paper is part of a longer effort on this sort of blue-sky track of designing algorithms with these three capabilities of credit assignment,\u00a0generalization and exploration. So, in an ICML 2017 paper, we came up with a more general theory for some sort of way to build algorithms with these three capabilities and some,\u00a0kind of,\u00a0environment assumptions under which these things are possible. But the algorithms in that paper are not particularly practical, so this ICML paper 2019 is more focused towards practicality, although I would still consider this a relatively theoretical paper. So, what did we do? So, we have this new algorithm, it\u2019s called PCID,\u00a0and we can use it to find near optimal policies for\u00a0a\u00a0restricted class of environments that we call Block MDPs. And it does this in a provably efficient way, both in terms of the amount of experience it requires,\u00a0and the total computation. And I think one thing I do want to highlight here is that these Block MDPs, they\u2019re relatively structured environments, but they do\u00a0actually require\u00a0the agent to demonstrate all three of those core capabilities:\u00a0of\u00a0credit assignment, generalization and exploration. But the other thing I want to highlight is the phrase \u201cprovably efficient\u201d that I mentioned. I don\u2019t want to suggest that other algorithms do not have these capabilities, what I\u2019m just saying is, we are able to prove that our algorithm does. Perhaps the key claim is,\u00a0we\u2019re proving that this algorithm has some data efficiency guarantee and that is something I kind of want to emphasize. So, let me tell you,\u00a0like,\u00a0the high level of the approach. It\u2019s superficially similar to,\u00a0I think,\u00a0what\u2019s now a relatively common strategy which is to create some auxiliary supervision based on what the agent is doing, and use that to identify some kind of compact representation of the world and then use the compact representation to drive an\u00a0exploration module. So, the way we sort of realize this is,\u00a0our representation is like a discreet latent state and that\u2019s why this thing is called decoding. We decode the latent state of the world and then we run one of these algorithms that\u00a0can do\u00a0exploration and credit assignment, but not generalization on the decoded latent state. So that\u2019s like the high-level idea. And to explain more I\u00a0have to\u00a0tell you about this\u00a0Block MDP model.\u00a0The model is this kind of stylized setting in which there exists\u00a0some latent state space that\u2019s relatively small\u00a0\u2013 in the literature they might call this a tabular state space\u00a0\u2013\u00a0but these states are never observed by the agent. Instead, the agent will be in a state, but it doesn\u2019t know that,\u00a0and it will see an observation which is like sort of emitted from a distribution associated with this state, and the observation might be extremely high-dimensional and very, very rich. But the observation might\u00a0actually have\u00a0enough information for the agent to figure out what its latent state is. But the agent sort of doesn\u2019t know how to do that compression. So, there\u2019s some mapping in the world that takes observations and maps them to the latent state. And if the agent only knew this mapping, it could just run one of these tabular methods,\u00a0which do exploration and credit assignment but don\u2019t do generalization. So, this model kind of encourages a little bit of a,\u00a0like,\u00a0a factored approach where you do the generalization part, you know, to map observations to state,\u00a0and then you do the other two things using a technique we already know.<\/p>\n Host:\u00a0<\/b>Okay.<\/b><\/p>\n Akshay\u00a0Krishnamurthy:\u00a0The key challenge here is that you never observe the latent state. So,\u00a0it\u2019s not that clear,\u00a0how are you going to learn this decoder? And this is sort of where the auxiliary supervision comes in. So,\u00a0we use supervised learning to train a predictor to predict some other stuff, and somehow,\u00a0we show that,\u00a0when you predict this other stuff, you recover the latent state. The algorithm proceeds in this,\u00a0like,\u00a0forward manner:\u00a0so,\u00a0you take,\u00a0like,\u00a0a hundred steps and then you reset.\u00a0And then you take a hundred steps and you reset. And each one of these is an episode. And so, we can assume we have the decoder at\u00a0stage H,\u00a0some time\u00a0step,\u00a0and we\u2019re going to use the decoder at the previous time to learn the decoder at the current time. And the way you do it is,\u00a0you take the observation at the current time and you predict the previous action and the previous latent state,\u00a0or decoded state.<\/p>\n Host:\u00a0<\/b>Right.<\/b><\/p>\n Akshay\u00a0Krishnamurthy: You can imagine it kind of recovers the dynamics of the world because the reason you\u2019re seeing this observation now is because you were in that state yesterday and you took that action yesterday. So,\u00a0you recover,\u00a0like,\u00a0the backward dynamics of the process. And because the dynamics are governed by the latent state only, you kind of learn this compression. And then,\u00a0once you have the latent state\u00a0at stage H, we learn how to visit every state that you could possibly reach at stage H,\u00a0and then that gives you a good data coverage. So that\u2019s like the exploration component. You get good data coverage at stage H,\u00a0and you use that to drive the process of learning the decoder at the next stage. So, the PCID supervision is, predict the previous learned state and action from the current observation. Okay, so\u00a0what do we prove? We prove that,\u00a0if you are in a Block\u00a0MDP with a not too many latent states, not too many actions per\u00a0stage, and the time horizon is not too long,\u00a0and then there\u2019s some technical assumptions,\u00a0we prove that this algorithm will find some way to visit every state of the world with good probability, without too many samples. So, the number of samples will be polynomial in the number of latent states, the time horizon, the number of actions, all these sort of relevant quantities, and then, we learn how to visit every state of the world, which is solving the exploration problem.<\/p>\n Host:\u00a0<\/b>Right.<\/b><\/p>\n Akshay\u00a0Krishnamurthy: And the generalization component is baked in because this\u00a0is\u00a0a Block MDP, and you\u00a0have to\u00a0do a decoding. On the computational side, we operate in this sort of \u2013 it\u2019s called an oracle model of computation\u00a0\u2013\u00a0so we assume that you have a class of functions, maybe it\u2019s like a neural network or something, and that you can solve supervised learning problems over this neural network. And that\u2019s where this auxiliary supervision comes in. We train a neural network,\u00a0or whatever,\u00a0to map raw observation\u00a0to previous state\u00a0action\u00a0and we assume that you can\u00a0actually solve\u00a0those classification problems. And so that\u2019s the computation. Let me just mention the last piece which is the first one where we\u00a0actually ran\u00a0some experiments.<\/p>\n Host:\u00a0<\/b>Okay.<\/b><\/p>\n Akshay\u00a0Krishnamurthy: Because this algorithm is not super practical, but not super impractical either, it\u2019s somewhere in the middle.<\/p>\n Host:\u00a0<\/b>It\u2019s the Goldilocks one.<\/b><\/p>\n Akshay\u00a0Krishnamurthy: Yeah.<\/p>\n Host:\u00a0<\/b>J<\/b>ust right.<\/b><\/p>\n Akshay\u00a0Krishnamurthy: It\u2019s one that we felt like we could implement. So the experiments,\u00a0I think,\u00a0are more of a proof of concept that it is actually doing something sensible than anything else,\u00a0and what we show is that we sort of concoct this kind of very, very difficult environment for exploration and we run these algorithms that don\u2019t do any exploration and they fail, obviously, and then we run an algorithm that\u2026 we give it cheating access to the latent state and we run an exploratory algorithm. So, this is like\u00a0that\u00a0algorithm that does exploration and credit assignment, but not generalization, and it does extremely well. And then we run our algorithm which does,\u00a0like,\u00a0not much worse than that thing. Okay. So, it\u2019s like the first one is, like,\u00a0obviously\u00a0not<\/i>\u00a0going to work. The second one is obviously\u00a0going<\/i>\u00a0to work, and the question is, how close are we to,\u00a0like,\u00a0the skyline that we could hope to achieve?\u00a0And so, we are\u00a0getting there and, but we\u2019re solving at that harder problem where you\u00a0have to\u00a0do the decoding.<\/p>\n Host:\u00a0<\/b>Okay. And that was 2019. We\u2019re all the way into 2020. Since then<\/b>,<\/b>\u00a0you haven\u2019t sat around on your algorithms. You have a new paper in the pipeline with another algorithm\u00a0<\/b>that you think is\u00a0<\/b>actually pretty<\/b>\u00a0cool. The paper is called\u00a0<\/b>Kinematic state abstraction and provably<\/i><\/b>\u00a0\u2013 there\u2019s that word again \u2013\u00a0<\/b>efficient rich observation reinforcement learning.<\/i><\/b>\u00a0That\u2019s a mouthful,\u00a0<\/b>and also<\/b>\u00a0a\u00a0<\/b>mathful<\/b>. In this paper you present another algorithm called Homer. Tell us about Home<\/b>r,\u00a0<\/b>and how it moves us closer to solving for X<\/b>,<\/b>\u00a0if X equals robust, real world reinforcement learning. And again, feel free to show your work.<\/b><\/p>\n Akshay\u00a0Krishnamurthy: Yeah, so Homer is\u2026 I would think of it as an improvement on PCID, but the high-level claims are kind of similar. So, we\u2019re still working in a Block MDP. On a technical level, we\u2019re\u00a0actually able\u00a0to remove some of these technical assumptions that I didn\u2019t really explain to you in the previous story. But those are\u00a0actually required\u00a0by the PCID\u2019s theory and we can actually show that PCID empirically fails when some of those things are not satisfied. So that\u2019s\u00a0actually kind\u00a0of important. But I think there are two,\u00a0perhaps,\u00a0more important points about Homer. So, Homer is much, much more robust in practice. And part of the reason is that,\u00a0in PCID, we\u2019re trying to use the current observation to predict the previous state and the previous action. Now we don\u2019t have access to the previous state, so we predict the previous predicted state, okay? And this can cause some catastrophic failure when your previous predicted state is wrong, okay? So, you\u2019re trying to predict something that you\u2019re predicting yourself which causes some sort of bubbling of errors that doesn\u2019t work very well. And so, in Homer, we use a different auxiliary supervision problem that does not rely on the previous one. We still do this kind of inductive thing, but we train this decoder for time H, and then we like throw it away, and then we train a completely new decoder for time H plus one. And so, this makes the algorithm somehow much, much more robust and much less prone to these, kind of, propagating errors.<\/p>\n Host: Okay.<\/b><\/p>\n Akshay\u00a0Krishnamurthy:\u00a0Which is a very, very common problem in reinforcement learning. The other thing that we\u2019re doing in Homer, which is,\u00a0I think,\u00a0empirically very important is, in PCID, we are using the latent state space that we build\u00a0\u2013 we\u00a0build like a dynamics model over the latent state space\u00a0\u2013\u00a0and we do what they call model-based planning. And this is also prone to failure when your model is wrong. Okay. So, you\u2019re trying to plan, in your model, to reach some state, but your model is wrong and so you just completely fail. And in Homer, we replaced this planning module with,\u00a0like,\u00a0a policy search module that operates again on the raw observations. So again here, we\u2019re not using the decoders, we\u2019re just saying we\u2019ve collected a good data set that covers the whole world and we can use the data to train a machine learning model to do the things we want to do. So those are the two reasons that Homer is,\u00a0like,\u00a0an empirical and in some ways a theoretical improvement. But the other,\u00a0more important point is that in the Homer project, we\u2019ve identified a new structural notion that we think might be useful in the design of other algorithms.\u00a0And this is also this first part of the title which is this kinematic state abstraction.\u00a0It\u2019s kind of some property of an environment where you might say two states of the world are very similar if the dynamics to and from those states are the same.\u00a0You should be able to collapse these two things together because,\u00a0if one policy gets to one, it also gets to the other. So, for the\u00a0purpose of exploration, these two states are very coupled. And for the purpose of forward prediction, they\u2019re also very coupled because they have the same dynamics going forward.<\/p>\n Host:\u00a0<\/b>Yeah.<\/b><\/p>\n Akshay\u00a0Krishnamurthy: And so, this kinematic state\u00a0abstraction\u00a0is a way to formalize this. And it\u2019s kind of easy to show that,\u00a0in Block MDPs, all the observations coming from a latent state are tied together. They induce the same forward and backward dynamics.\u00a0This idea also inspires this new supervised learning auxiliary supervision problem that we use here. So, this prediction problem is what they call contrastive learning. So, what you\u2019re going to try and do is you\u2019re going to splice data sequences together in some real and fake way,\u00a0and you\u2019re going to try to predict which ones are real and which ones are fake. So, I\u2019m the agent.\u00a0I see some observation. I take an action. I see some next observation. That\u2019s a real transition. So, I have an observation-action-observation triple. Now, I\u2019m at some new place in the world, a new observation, I take a different action. I see another observation. Let me swap the last two observations between these two triples. Those are fake transitions, right? And now,\u00a0if I can somehow train a model to predict what\u2019s real and fake from this crazy data set that I just like made up, right? But I know the label because I\u2019m doing the splicing. So, I train a supervised learning model, just a classifier, to predict what\u2019s real and what\u2019s fake.\u00a0If you think about it, you can kind of just pick out the transition operator because you know what a real transition is now. And that means you know what\u00a0the,\u00a0like,\u00a0distribution of the next state is going to be given your current state in action.<\/p>\n Host: Yeah.\u00a0<\/b><\/p>\n Akshay\u00a0Krishnamurthy:\u00a0And again, we can show that somehow, when you solve this problem,\u00a0you\u00a0kind of are decoding the latent state.\u00a0And then,\u00a0like,\u00a0sort of a lot of the stuff that we do with PCIDs is going through again.\u00a0On the empirical side, it\u2019s still a relatively toy experiment. But\u00a0we\u2019re more serious about comparing with the algorithms that people are running in practice. There was one algorithm that was pretty good. So, this is one of these policy search methods\u00a0when you attach on an exploration module based on this idea called random network distillation. So, it\u2019s this chain of like one hundred actions, so this thing got like to like level twenty-five or thirty or something, but Homer\u00a0just\u00a0crushes the whole thing because it\u2019s kind of designed to do so. I think it\u2019s,\u00a0in some ways,\u00a0demonstrating what is a very difficult exploration problem.<\/p>\n Host: Yeah.\u00a0<\/b><\/p>\n Akshay\u00a0Krishnamurthy:\u00a0The algorithms that people typically use do not solve those hard exploration problems. And Homer is sort of constructed to do so and does it.<\/p>\n (music plays)<\/i><\/b><\/p>\n Host:\u00a0<\/b>Let\u2019s talk briefly about the section\u00a0<\/b>in<\/b>\u00a0every academic paper that addresses research limitations and what scientists are still working on. In general, what big challenges remain\u00a0<\/b>in this process or pipeline of algorithmic perfection, and how are you moving into unlit spaces to resolve them?<\/b><\/p>\n Akshay\u00a0Krishnamurthy:\u00a0Let\u2019s think about this on the theory side and the empirical side.\u00a0So, on the theory side, I think it\u2019s becoming important to identify more expressive models that emphasize the need for these core capabilities of generalization,\u00a0exploration and credit assignment. To date, we\u2019ve been working on the Block\u00a0MDP\u00a0model which,\u00a0it certainly does these things, but it\u2019s quite limited in many ways. So, a lot of problems tend to have like a continuous but low-dimensional latent state, and this is not captured by the Block MDP.\u00a0But we\u2019re also thinking about like, somehow, substantially more expressive models. So,\u00a0we have some new proposal for a model that\u2019s like exponentially more expressive than the Block MDP.<\/p>\n Host: Okay.<\/b><\/p>\n Akshay\u00a0Krishnamurthy:\u00a0And whether we can build algorithms for those things. So like Homer and PCID really are leveraging this discreet latent state space in some very, very strong way and,\u00a0because they\u2019re so attached to the model, they don\u2019t really work in the real world. Like your real world does not have that Block MDP structure and then what happens, right? But the other frustration here is that we kind of know that some modelling assumptions are required for provable sample and computational efficiency, and so the question is, what are the assumptions that are relevant to applications? And perhaps we should just keep expanding until we cover most things, but it might not actually work in that way. Like,\u00a0maybe we\u2019ll have to sort of fragment and say okay,\u00a0for this type of application, we\u2019re going to focus on this set of assumptions which are relevant and come up with a specialized set of algorithms and so on. So that\u2019s kind of the theory side. I think we are sort of pushing on,\u00a0like,\u00a0more expressive models and algorithms that address more general settings. On the empirical side, Homer and PCID are very much algorithms designed by theoreticians, and there\u2019s a question of whether we can make them more empirically effective. I think there are problems that do have something close to Block MDP structure and if we can make these algorithms somehow more natural, more robust, more practically viable, maybe we can start pushing towards\u00a0those kind of problems.\u00a0And so,\u00a0we have an ongoing effort around this where we\u2019re trying to identify some problems that might have the kind of structures that Homer\u00a0is sort of exploiting and also trying to make Homer more somehow a little more data-efficient in these kind of things.<\/p>\n Host:\u00a0<\/b>It\u2019s about now that I always ask what keeps you up at night, both figuratively and literally. Reinforcement learning is a bit of a rock star when it\u2019s in a controlled or simulated environment, but once it hits the open world, the state space gets real<\/b>,<\/b>\u00a0as we might say. So, what could possibly go wrong,\u00a0<\/b>Akshay<\/b>, and how are you working<\/b>,<\/b>\u00a0from the beginning<\/b>,<\/b>\u00a0to ensure that it doesn\u2019t?<\/b><\/p>\n Akshay\u00a0Krishnamurthy: Well, to be honest, what usually keeps me up at night is some,\u00a0like,\u00a0bug in a proof or something in my code that\u2019s not working and this has been especially true in the recent months because I\u2019ve been,\u00a0like,\u00a0working very hard on this fairly delicate proof and keep\u00a0getting stuck like every day. So mostly I can\u2019t sleep because I,\u00a0like,\u00a0cannot solve my,\u00a0you know, technical problem. On the higher-level, in terms of,\u00a0I think,\u00a0what could go wrong, I\u2019m definitely aware that when you pursue these sort of blue sky efforts, they\u2019re necessarily pretty high-risk and it could be the case that,\u00a0when you take this theory-first approach,\u00a0or even when you are thinking about reinforcement learning in the way that we were thinking, it just might not pan out. So, it just could be the case that we push on this for you know, five years, ten years,\u00a0and then we just hit a dead end. And that would kind of be depressing, but I think\u00a0it\u2019s\u00a0part of what happens when you do this kind of high-risk stuff.<\/p>\n Host:\u00a0<\/b>Yeah.<\/b><\/p>\n Akshay\u00a0Krishnamurthy: But I\u2019m also a bit reassured that,\u00a0you know,\u00a0RL is a very, very popular topic. We have a large community of incredibly brilliant and amazing people that are working on these problems,\u00a0and everyone is tackling them with a different approach and different perspective. So even if what we\u2019re trying to do doesn\u2019t really pan out or doesn\u2019t do the things that we want, I think we will find something that makes sense as a community.\u00a0And from like a scientific perspective, that sort of I think keeps me sane. And I always have conversations with people about this kind of stuff.\u00a0Like,\u00a0when I give a talk about our work, they\u2019re like, well, do you really think this is going to pan out?\u00a0And I say, I don\u2019t know, but like someone should try it, right? So, I think there\u2019s a real risk that maybe these approaches don\u2019t make sense.\u00a0And obviously we\u2019re thinking about when they do make sense and trying to make them make sense, but maybe they will not work, right? And that kind of keeps me up at night a little bit, but usually it\u2019s the more technical stuff.\u00a0The other concern about,\u00a0I think which is sort of what you are talking about is,\u00a0when we deploy these algorithms in the real world, what could go wrong? I certainly think a lot of things could go wrong and there\u2019s a lot of things to be concerned about, like,\u00a0you know, there\u2019s biases, there\u2019s safety issues, there\u2019s fairness issues. And there\u2019s like become a very vibrant community of people working on\u00a0those kind of problems.\u00a0I think we\u2019re so far from applications\u2026 like I don\u2019t encourage anyone to deploy Homer anywhere!\u00a0But I think I would say that, you know, MSR has brilliant people working on\u00a0these kind of issues. And I think if and when we do get closer to deploying something, I am super fortunate to feel like I can go talk to them and like loop them in and see like what should we be doing and \u2013 I don\u2019t claim to be an expert on those things at all, so I would just want to take advice from other people on these kind of things.<\/p>\n Host:\u00a0<\/b>Well<\/b>,<\/b>\u00a0it\u2019s story time here on the Microsoft Research podcast. So, let\u2019s hear yours. What got you started in high tech, what was your academic path toward research and how did you land in New York City (or at least post-COVID, the home office in Brooklyn) working on reinforcement learning for Microsoft Research?<\/b><\/p>\n Akshay\u00a0Krishnamurthy: Yeah, so I grew up in a pretty technical family. I was born and raised in Silicon Valley. Both\u00a0of\u00a0my parents were engineers. My brother is a computer scientist, so I think it was kind of in the cards. That said, I didn\u2019t think I wanted to be an academic probably through most of grad school. So, when I went to college, I kind of had more of an entrepreneurial bent. I\u00a0think this is\u00a0super natural\u00a0for people in Silicon Valley. They all kind of want to do that. You hear about all these tech stars and so on. So, I didn\u2019t really think I wanted to be an academic. I was, you know, hacking on random start-uppy\u00a0project ideas in my dorm room and stuff. I did some industry internships at tech start-ups and\u00a0these kind of things. And I\u00a0actually didn\u2019t\u00a0really like it. So kind of what happened is, I did those things and I didn\u2019t enjoy it and then I\u00a0did\u00a0a research project in undergrad and then I also had this amazing opportunity to spend a summer in Tel\u00a0Aviv doing a research project with a professor there. And that was like super fun and really kind of pushed me towards going to grad school, and then my last summer of grad school, I was like super fortunate that\u00a0Alekh\u00a0and Miro like invited me to do an internship.\u00a0And they were just, hey, do you want to come spend the summer in New York doing an internship? And I was like yeah, that sounds amazing and, so I did that. I really,\u00a0like,\u00a0fell in love with the lab. And since then like I think John has really been an advocate for me, so he\u2026<\/p>\n Host:\u00a0<\/b>Langford?<\/b><\/p>\n Akshay\u00a0Krishnamurthy: John Langford, yeah. So, he like encouraged me to apply for a postdoc. He was on my thesis committee. He wrote a letter for me. So, I came back for a postdoc and then I spent two years as an assistant professor at U Mass and then he\u00a0sort\u00a0of encouraged me to come back as a researcher. I had to do this advertisement at a conference about MSR and I think the thing that is real is like I\u2019ve been there for three different stages of my life and I keep coming back. So, it is like really a glorious place to be to do research and I love it a lot.\u00a0So,\u00a0I think having these people like\u00a0Alekh\u00a0Agarwal, Miro\u00a0Dudik, John Langford, they have been super inspirational. They\u2019re really like role models for me and have really like given me an opportunity to you know, be here and succeed and so on.<\/p>\n Host:\u00a0<\/b>What\u2019s one interesting thing we don\u2019t know about you? I leave it open-ended so it can really be anything. Maybe it\u2019s a personal characteristic or life event that impacted your career or life. Maybe it\u2019s a side quest or a hobby that defines you outside your scientific work or maybe it\u2019s just something fun or funny that people might not suspect about you?<\/b><\/p>\n Akshay\u00a0Krishnamurthy: Yeah, so I\u00a0was\u00a0actually discussing\u00a0this question last night with my partner, and the conclusion is that I\u2019m pretty clich\u00e9 for\u00a0a,\u00a0like,\u00a0a tech bro.\u00a0So,\u00a0I do all the things that like, you know,\u00a0the San Francisco start-up people do. I like to rock climb, I play ultimate Frisbee, I read a ton of sci-fi. I\u00a0am re-reading The Three Body Problem right now which is just an excellent book, I cannot recommend it enough. I\u2019m also kind of a foodie. So, in the pre-COVID times, Sid Sen and I would frequently sneak out of the office to grab you know, fancy sweets, and being in New York, there\u2019s\u00a0just\u00a0like tons of amazing restaurants and dessert places to go to. So,\u00a0when I go to conferences, I almost always bring my climbing equipment. And there\u2019s starting to be like a group of people that are in this crew of machine learning climbers and we typically go check out the local bouldering gyms or maybe even go to like an outdoor spot for climbing. And there\u2019s also,\u00a0obviously,\u00a0a bunch of foodies in the ML community, so we have a small group of people who go check out like,\u00a0you know, Michelin star restaurant and stuff. So, like the academic lifestyle is kind of great for\u00a0those kind of things.<\/p>\n Host: What\u2019s the biggest thing you\u2019ve climbed?<\/b><\/p>\n Akshay\u00a0Krishnamurthy:\u00a0So,\u00a0I mostly boulder, so the biggest thing I\u2019ve climbed is not very tall. And mostly I\u00a0actually do,\u00a0like,\u00a0indoor bouldering, so,\u00a0you know,\u00a0these climbing gyms are\u00a0cropping up everywhere.<\/p>\n Host: Right.<\/b><\/p>\n Akshay\u00a0Krishnamurthy: At\u00a0NeurIPS\u00a0last year,\u00a0me and friend went to\u2026 so\u00a0this was in Vancouver, there\u2019s this like you know, world-class bouldering area just north in Squamish.<\/p>\n Host: Yes.<\/b><\/p>\n Akshay\u00a0Krishnamurthy: And so, we went bouldering in Squamish. It was a bit raining so the rocks were kind of wet\u2026<\/p>\n Host: Slippery.<\/b><\/p>\n Akshay\u00a0Krishnamurthy:\u00a0\u2026which means we couldn\u2019t do anything.<\/p>\n Host: No.<\/b><\/p>\n Akshay\u00a0Krishnamurthy:\u00a0I think I\u2019m also not that good, so\u2026 The outdoors is\u00a0really just\u00a0when you kind of get humiliated.<\/p>\n Host: Yeah.<\/b><\/p>\n Akshay\u00a0Krishnamurthy:\u00a0Nature slams you. And I think the other thing is like the way they do this\u00a0grading,\u00a0the grading outdoors is not commensurate with the grading indoors. So, you kind of get humbled.<\/p>\n Host:\u00a0<\/b>They need an algorithm for that.<\/b><\/p>\n Akshay\u00a0Krishnamurthy: Yeah.<\/p>\n Host:\u00a0<\/b>Well as we close, and this has been super fun<\/b>,\u00a0<\/b>I\u2019d like you to dream out loud about the future for a minute and give our listeners a vision for next gen reinforcement learning. I sometimes frame this now as, if you are wildly successful, what will you be known for at the end of your career<\/b>,<\/b>\u00a0and how will the research you and your colleagues have done made an impact on the world, both to the scientific community and people at large?<\/b><\/p>\n Akshay\u00a0Krishnamurthy: So, I really hope that we can be involved in the development of data-efficient, empirically effective reinforcement learning algorithms. I think it\u2019s a little bit hard to predict where they will be effective. So, this is this\u00a0lamppost sort of phenomenon that we were talking about the other day. I don\u2019t know what all the applications might be for such algorithms, if we have them, and this is part of this constrained thinking where, like\u00a0you know,\u00a0we think about applications that our current technologies can solve and we don\u2019t think about applications that maybe new technologies might enable.\u00a0Looking under the lamppost is this story\u2026\u00a0or it\u2019s like a metaphor\u2026\u00a0so there\u2019s\u00a0this\u00a0person and they lose their keys. It\u2019s the nighttime, they\u2019re looking for their keys under the lamppost because that\u2019s the only place where they can see, even though they are a\u00a0hundred percent\u00a0sure that is not where they lost their keys. It\u2019s just where there\u2019s light. So, this is this thing of,\u00a0you know, we work on the things that we know we can do. And the applications that we think about are the things that are like just \u2013 just beyond our grasp.\u00a0But maybe if we have some radically new technology, the set of applications will just really\u00a0open up\u00a0and we cannot really predict what those things might be. I think it would be\u00a0really awesome\u00a0to be a part of that effort and see what kind of things open up, but it would be really amazing to see that these algorithms do have like a positive impact on the world.<\/p>\n Host:\u00a0<\/b>Akshay<\/b>\u00a0Krishnamurthy, this has been more fun than a podcast about algorithms should be. Thank you so much for coming on<\/b>!<\/b><\/p>\n Akshay\u00a0Krishnamurthy: Thank you very much!<\/p>\n (music plays)<\/i><\/b><\/p>\n To learn more about Dr.\u00a0<\/i><\/b>Akshay<\/i><\/b>\u00a0Krishnamurthy, and the very latest in reinforcement learning, visit Microsoft.com\/research<\/i><\/b><\/p>\n","protected":false},"excerpt":{"rendered":" MSR\u2019s New York City lab is home to some of the best reinforcement learning research on the planet but if you ask any of the researchers, they\u2019ll tell you they\u2019re very interested in getting it out of the lab and into the real world. One of those researchers is Dr. Akshay Krishnamurthy and today, he explains how his work on feedback-driven data collection and provably efficient reinforcement learning algorithms is helping to move the RL needle in the real-world direction.<\/p>\n","protected":false},"author":37583,"featured_media":663786,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"https:\/\/player.blubrry.com\/id\/61698926\/","msr-podcast-episode":"117","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"footnotes":""},"categories":[240054],"tags":[],"research-area":[13561,13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-663771","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-msr-podcast","msr-research-area-algorithms","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"https:\/\/player.blubrry.com\/id\/61698926\/","podcast_episode":"117","msr_research_lab":[199571],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[395930,144902],"related-projects":[568491],"related-events":[],"related-researchers":[],"msr_type":"Post","featured_image_thumbnail":"","byline":"","formattedDate":"June 3, 2020","formattedExcerpt":"MSR\u2019s New York City lab is home to some of the best reinforcement learning research on the planet but if you ask any of the researchers, they\u2019ll tell you they\u2019re very interested in getting it out of the lab and into the real world. One…","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/663771"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/37583"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=663771"}],"version-history":[{"count":7,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/663771\/revisions"}],"predecessor-version":[{"id":668112,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/663771\/revisions\/668112"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/663786"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=663771"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=663771"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=663771"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=663771"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=663771"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=663771"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=663771"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=663771"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=663771"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=663771"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=663771"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}Related:<\/h3>\n
\n
\nTranscript<\/h3>\n