{"id":584854,"date":"2019-05-08T12:44:12","date_gmt":"2019-05-08T19:44:12","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=584854"},"modified":"2019-06-17T11:23:07","modified_gmt":"2019-06-17T18:23:07","slug":"reinforcement-learning-for-the-real-world-with-dr-john-langford-and-rafah-hosn","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/podcast\/reinforcement-learning-for-the-real-world-with-dr-john-langford-and-rafah-hosn\/","title":{"rendered":"Reinforcement learning for the real world with Dr. John Langford and Rafah Hosn"},"content":{"rendered":"
<\/a><\/p>\n Dr. John Langford<\/a>, a partner researcher in the Machine Learning group at Microsoft Research New York City, is a reinforcement learning expert who is working, in his own words, to solve machine learning. Rafah Hosn<\/a>, also of MSR New York, is a principal program manager who\u2019s working to take that work to the world. If that sounds like big thinking in the Big Apple, well, New York City has always been a \u201cgo big, or go home\u201d kind of town, and MSR NYC is a \u201cgo big, or go home\u201d kind of lab.<\/p>\n Today, Dr. Langford explains why online reinforcement learning is critical to solving machine learning and how moving from the current foundation of a Markov decision process toward a contextual bandit<\/a> future might be part of the solution. Rafah Hosn talks about why it\u2019s important, from a business perspective, to move RL agents out of simulated environments and into the open world, and gives us an under-the-hood look at the product side of MSR\u2019s \u201cresearch, incubate, transfer\u201d process, focusing on real world reinforcement learning which, at Microsoft, is now called Azure Cognitive Services Personalizer<\/a>.<\/p>\n Host: Welcome to another two-chair, two-mic episode of the Microsoft Research Podcast. Today we bring you the perspectives of two guests on the topic of reinforcement learning for online applications. Since most research wants to be a product when it grows up, we\u2019ve brought in a brilliant researcher\/program manager duo to illuminate the classic \u201cresearch, incubate, transfer\u201d process in the context of real-world reinforcement learning.<\/strong><\/p>\n Host: You\u2019re listening to the Microsoft Research Podcast, a show that brings you closer to the cutting-edge of technology research and the scientists behind it. I\u2019m your host, Gretchen Huizinga.<\/strong><\/p>\n Host: Dr. John Langford, a partner researcher in the Machine Learning group at Microsoft Research New York City, is a reinforcement learning expert who is working, in his own words, to solve machine learning. Rafah Hosn, also of MSR New York, is a principal program manager who\u2019s working to take that work to the world. If that sounds like big thinking in the Big Apple, well, New York City has always been a \u201cgo big, or go home\u201d kind of town, and MSR NYC is a \u201cgo big, or go home\u201d kind of lab.<\/strong><\/p>\n Today, Dr. Langford explains why online reinforcement learning is critical to solving machine learning and how moving from the current foundation of a Markov decision process toward a contextual bandit future might be part of the solution. Rafah Hosn talks about why it\u2019s important, from a business perspective, to move RL agents out of simulated environments and into the open world, and gives us an under-the-hood look at the product side of MSR\u2019s \u201cresearch, incubate, transfer\u201d process, focusing on real world reinforcement learning which, at Microsoft, is now called Azure Cognitive Services Personalizer. That and much more on this episode of the Microsoft Research Podcast.<\/p>\n Host: I\u2019ve got two guests in the both today, both working on some big research problems in the Big Apple. John Langford is a partner researcher in the Machine Learning group at MSR NYC, and Rafah Hosn, also at the New York lab, is the principal program manager for personalization service, also known as real-work reinforcement learning. John and Rafah, welcome to the podcast.<\/strong><\/p>\n John Langford: Thank you.<\/p>\n Rafah Hosn: Thank you.<\/p>\n Host: Microsoft Research\u2019s New York lab is relatively small, in the constellation of MSR labs, but there\u2019s some really important work going on there. So, to get us started, tell us what each of you does for a living and how you work together. What gets you up in the morning? Rafah, why don\u2019t you start?<\/strong><\/p>\n Rafah Hosn: Okay, I\u2019ll start. So I wake up every day and think about all the great things that the reinforcement learning researchers are doing and first I map what they\u2019re working on, something that could be useful for customers, and then I think to myself, how can we now take this great research, which typically comes in the form of a paper, to a prototype, to an incubation, to something that Microsoft can make money out of?<\/p>\n Host: That\u2019s a big thread, starting with a little seed, and ending up with a big plant at the end.<\/strong><\/p>\n Rafah Hosn: Yes, we have to think big.<\/p>\n Host: That\u2019s right. How about you, John?<\/strong><\/p>\n John Langford: I want to solve machine learning! And that\u2019s ambitious, but one of the things that you really need to do if you want to solve machine learning is you need to solve reinforcement learning, which is kind of the common basis for learning algorithms to learn from interaction with the real world. And so, figuring out new ways to do this, or trying to expand the scope of where we can actually apply these techniques, is what really drives me.<\/p>\n Host: Can you go a little deeper into \u201csolve machine learning?\u201d What would solving machine learning look like?<\/strong><\/p>\n John Langford: It would look like anything that you can pose to a machine learning problem you can solve, right? So, I became interested in machine learning back when I was an undergrad, actually.<\/p>\n Host: Yeah.<\/strong><\/p>\n John Langford: I went to a machine learning class and I was like, oh, this is what I want to do for my life! And I\u2019ve been pursuing it ever since.<\/p>\n Host: And here you are.<\/strong><\/p>\n John Langford: Yeah.<\/p>\n Host: So, we\u2019re going to spend the bulk of our time today talking about the specific work you\u2019re doing in reinforcement learning. But John, before we get into it, give us a little context as a level set. From your perspective, what\u2019s unique about reinforcement learning within the machine learning universe, and why is it an important part of MSR\u2019s research portfolio?<\/strong><\/p>\n John Langford: So, most of the machine learning that\u2019s actually deployed is of the supervised learning variety. And supervised learning is fundamentally about taking expertise from people and making that into some sort of learned function that you can then use to do some task. Reinforcement learning is different because it\u2019s about taking information from the world and learning a policy for interacting with the world so that you perform better in one way or another. So, that different source of information can be incredibly powerful, because you can imagine a future where, every time you type on the keyboard, the keyboard learns to understand you better, right? Or every time you interact with some website, it understands better what your preferences are, so the world just starts working better and better in interacting with people.<\/p>\n Host: And so, reinforcement learning, as a method within the machine learning world, is different from other methods because you deploy it in less-known circumstances, or how would you define that?<\/strong><\/p>\n John Langford: So, it\u2019s different in many ways, but the key difference is the information source. The consequence of that is that reinforcement learning can be surprising. It can actually surprise you. It can find solutions you might have not thought of to solve problems that you posed to it. That\u2019s one of the key things. Another thing is, it requires substantially more skill to apply than supervised learning. Supervised learning is pretty straightforward as far as the statistics go, while reinforcement learning, there\u2019s some real traps out there, and you want to think carefully about what you\u2019re doing. Let me go into a little more detail there.<\/p>\n Host: Please do.<\/strong><\/p>\n John Langford: Let\u2019s suppose you need to make a sequence of ten steps, and you want to maximize the rewards you get in those ten steps, right? So, it might be the case that going left gives you a small reward immediately, and then you get no more rewards. While if you go right, you get no reward, and then you go left, and then right, and then right, and then left, and then right, so on, ten times\u2026 do it just the right way, you get a big reward, right? So many reinforcement learning algorithms just learn to go left, because that gave the small reward immediately. And that gap is not like a little gap. It\u2019s like, you may require exponentially many more samples to learn unless you actually gather the information in an intelligible, conscious way.<\/p>\n Host: Yeah. I\u2019m grinning, and no one can see it, because I\u2019m thinking, that\u2019s how people operate generally, you know? If I…<\/strong><\/p>\n Rafah Hosn: Actually, yeah. I mean, the way I explain reinforcement learning is the way you teach a puppy how to do a trick. And the puppy may surprise you and do something else, but the reward that John speaks of is the treat that you give the puppy when the puppy does what you are trying to teach it to do, and sometimes they just surprise you and do something different. And actually, reinforcement learning has a very great affinity to Pavlovian psychology.<\/p>\n Host: Well, back to your example, John, you\u2019re saying if you turn left you get the reward immediately.<\/strong><\/p>\n John Langford: Yeah, a small reward immediately.<\/p>\n Host: A small reward. So, the agent would have to go through many, many steps of this to figure out, don\u2019t go left, because you\u2019ll get more later.<\/strong><\/p>\n John Langford: You\u2019ll get more later if you go right and you take the right actions after you go right.<\/p>\n Rafah Hosn: Now, imagine explaining this to a customer.<\/p>\n Host: And we will get there, and I\u2019ll have you explain it. Rafah, let\u2019s talk for a second about the personalization service, which is an instantiation of what you call real-world reinforcement learning, yeah?<\/strong><\/p>\n Rafah Hosn: That\u2019s right.<\/p>\n Host: So, you characterize it as a general framework for reinforcement learning algorithms that are suitable for real-world applications. Unpack that a bit. Give us a short primer on real-world reinforcement learning and why it\u2019s an important direction for reinforcement learning in general.<\/strong><\/p>\n Rafah Hosn: Yeah, I\u2019ll give you my version, and I\u2019m sure John will chime in. But, you know, many of the reinforcement learning that people hear about are almost always done in a simulated environment, where you can be creative as to what you simulate, and you can generate, you know, gazillions of samples to make your agents work. Our type of reinforcement\u2026 John\u2019s type of reinforcement learning is something that we deploy online, and what drives us, John and I, is to create or use this methodology to solve real-world problems. And our goal is really to advance the science in order to help enterprises maximize their business objective through the usage of real-world reinforcement learning. So, when I say real world, these are models that we deploy, in production with real users, getting real feedback, and they learn on the job.<\/p>\n Host: Well, John, talk a little bit about what Rafah has alluded to. There\u2019s an online, real-world element to it, but prior to this, reinforcement learning has had some big investments in the gaming space. Tell us the difference and what happens when you move from a very closed environment to a very open environment from a technical perspective.<\/strong><\/p>\n John Langford: Yeah, so I guess the first thing to understand is why you\u2019d want to do this, because if reinforcement learning in simulators works great, then why do you need to do something else? And I guess the answer is, there are many things you just can\u2019t simulate. So, an example that I often give in talks is, would I be interested in a news article about Ukraine? The answer is, yes. Because my wife is from Ukraine. But you would never know this. Your simulator would never know this. There would be no way for the policy to actually learn that if you\u2019re learning in the simulator.<\/p>\n Host: Right.<\/strong><\/p>\n John Langford: So, there are many problems where there are no good simulators. And in those simulators, you don\u2019t have a choice. So, given that you don\u2019t have a choice, you need to embrace the difficulties of the problem. So, what are the difficulties of the real-world reinforcement learning problems? Well, you don\u2019t have zillions of examples which are typically required for many of the existing deep reinforcement learning algorithms. You need to be careful about how you use your samples. You need to use them to maximum and utmost efficiency in trying to do the learning. Another element that happens is often, when people have simulators, those simulators are kind of effectively stationary. They stay the same throughout the process of training. But in real-world problems, many of them that we encounter, we run into all kinds of non-stationarities, these exogenous events, the algorithms need to be very robust, so the combination of using samples very efficiently in great robustness in these algorithms are kind of key offsetting elements from what you might see in other places.<\/p>\n Host: Which is challenging Alpha Go or Ms. Pacman or the other games that have been sort of flags waved about our progress in reinforcement learning?<\/strong><\/p>\n John Langford: I think those are fun applications. I really enjoy reading about them and learning about them. I think it\u2019s a great demonstration of where the field has gotten, but I feel like there\u2019s this issue of AI winter, right? So, there was once a time when AI crashed. That may happen again, because AI is now a buzzword. But I think it\u2019s important that we actually do things that have some real value in the world which actually affect peoples\u2019 lives, because that\u2019s what creates a lasting wave of innovation and puts civilization into a new place.<\/p>\n Host: Right.<\/strong><\/p>\n John Langford: So that\u2019s what I\u2019m really seeking.<\/p>\n Host: What season are we in now? I\u2019ve heard there has been more than one AI winter, and some people are saying it\u2019s AI spring. I don’t know. Where do you see us in terms of that progress?<\/strong><\/p>\n John Langford: I think it\u2019s fair to say that there\u2019s a lot of froth in terms of people claiming things that are not going to come to pass. At the same time, there is real value being created. Suddenly we can do things and things work better through some of these techniques, right? So, it\u2019s kind of this mishmash of overpromised things that are going to fail, and there are things that are not overpromised, and they will succeed, and so if there\u2019s enough of those that succeed, then maybe we don\u2019t have a winter. Maybe it just becomes a long summer.<\/p>\n Host: Like San Diego all the time\u2026.<\/strong><\/p>\n Rafah Hosn: Yeah, but I think, to comment on John\u2019s point here, I think reinforcement learning is a nascent technique compared to supervised learning. And what\u2019s important is to do the crawl, walk, run, right? So, yeah, it\u2019s sexy now and people are talking about it, but we need to rein it in from a business perspective as to, you know, what are the classes of problems that we can satisfy the business leader with? And satisfy them effectively, right? And I think, from a reinforcement learning, John, correct me, we are very much at the crawl phase in solving generic business problems.<\/p>\n John Langford: I mean, we have solved some generic business problems. But we don\u2019t have widely deployed, or deployable, platforms for reusing those solutions over and over again. And it\u2019s so easy to imagine many more applications than people have even tried. So, we\u2019re nowhere near a mature phase in terms of even simple kinds of reinforcement learning. We are ramping up in our ability to solve real-world reinforcement learning problems\u2026<\/p>\nEpisode 75, May 8, 2019<\/h3>\n
Related:<\/h3>\n
\n
\nTranscript<\/h3>\n