{"id":590815,"date":"2019-06-06T09:56:55","date_gmt":"2019-06-06T16:56:55","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=590815"},"modified":"2019-06-06T11:21:04","modified_gmt":"2019-06-06T18:21:04","slug":"reliability-in-reinforcement-learning","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/reliability-in-reinforcement-learning\/","title":{"rendered":"Reliability in Reinforcement Learning"},"content":{"rendered":"<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability-In-Reinforcement-Learning_Site_06_2019_1400x788.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-591157 size-large aligncenter\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability-In-Reinforcement-Learning_Site_06_2019_1400x788-1024x576.png\" alt=\"\" width=\"1024\" height=\"576\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability-In-Reinforcement-Learning_Site_06_2019_1400x788-1024x576.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability-In-Reinforcement-Learning_Site_06_2019_1400x788-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability-In-Reinforcement-Learning_Site_06_2019_1400x788-768x432.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability-In-Reinforcement-Learning_Site_06_2019_1400x788-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability-In-Reinforcement-Learning_Site_06_2019_1400x788-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability-In-Reinforcement-Learning_Site_06_2019_1400x788-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability-In-Reinforcement-Learning_Site_06_2019_1400x788.png 1400w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/p>\n<p>Reinforcement Learning (RL), much like scaling a 3,000-foot rock face, is about learning to make sequential decisions. The list of potential RL applications is expansive, spanning robotics (drone control), dialogue systems (personal assistants, automated call centers), the game industry (non-player characters, computer AI), treatment design (pharmaceutical tests, crop management), complex systems control (for resource allocation, flow optimization), and so on.<\/p>\n<p>Some RL achievements are quite compelling. For example, Stanford University\u2019s RL team learned how to <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/link.springer.com\/chapter\/10.1007\/11552246_35\">control a reduced model helicopter<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and even learned <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"http:\/\/papers.nips.cc\/paper\/3151-an-application-of-reinforcement-learning-to-aerobatic-helicopter-flight\">new aerobatics with it<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. Orange Labs deployed the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/aclweb.org\/anthology\/papers\/W\/W10\/W10-4332\/\">first commercial dialogue system optimized with RL<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. DeepMind invented DQN, the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.nature.com\/articles\/nature14236\/\">first deep RL algorithm<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> capable of playing Atari games at human skill level using visual inputs, and were able to <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.nature.com\/articles\/nature24270\/\">train a Go AI by playing it solely against itself<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. It reigns undefeated in its Go matches against human players.<\/p>\n<p>Despite such remarkable achievements, applying RL to most real-world scenarios remains a challenge. There are several reasons for this. Deep RL algorithms are not sample efficient; they require billions of samples to obtain their results and extracting such astronomical numbers of samples in real world applications isn\u2019t feasible. RL also falls short in the area of morality constraints; the algorithms need to be safe. They have to be able to learn in real-life scenarios without risking lives or equipment. The RL algorithms also need to be fair; they must ensure they are not discriminating against people. Finally, the algorithms need to be reliable and deliver consistently solid results.<\/p>\n<p>This blog post focuses on reliability in reinforcement learning. Deep RL algorithms are impressive, but only when they work. In reality, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/deep-reinforcement-learning-matters\/\">they are largely unreliable<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. Even worse, two runs with different random seeds can yield very different results because of the stochasticity in the reinforcement learning process. We propose two ways of mitigating this.<\/p>\n<h3>Algorithm Selection<\/h3>\n<p>The <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/reinforcement-learning-algorithm-selection\/\">first idea<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (published last year at <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/iclr.cc\/Conferences\/2018\">ICLR<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>) is quite simple. If one algorithm is not reliable, we train several of them and use the best one. Figure 1 illustrates the algorithmic selection process. At the beginning of each episode, the algorithm selector selects an algorithm from the Portfolio. The selected algorithm outputs a policy that will be utilized for the entirety of the subsequent episode. Then, in the green area, the standard RL loop runs until the completion of the episode. The generated trajectory is recorded and fed to the algorithms for future training. The performance measure is sent to the algorithm selector, so that it selects the fittest algorithms in the future.<\/p>\n<div id=\"attachment_590821\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability_in_RL_Figure_1.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-590821\" class=\"wp-image-590821 size-large\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability_in_RL_Figure_1-1024x464.jpg\" alt=\"Figure 1: Algorithm selection in reinforcement learning.\" width=\"1024\" height=\"464\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability_in_RL_Figure_1-1024x464.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability_in_RL_Figure_1-300x136.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability_in_RL_Figure_1-768x348.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability_in_RL_Figure_1.jpg 1384w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><p id=\"caption-attachment-590821\" class=\"wp-caption-text\"><span class=\"sr-only\"> (opens in new tab)<\/span><\/a> Figure 1: Algorithm selection in reinforcement learning.<\/p><\/div>\n<p>Preliminary experiments in this study were conducted using a negotiation dialogue game. Focusing on one particular portfolio, our system, denoted by ESBAS, was composed of only two algorithms. The performance of each algorithm over time is displayed on Figure 2. The first algorithm, in blue, constantly generates the same policy with performance 1, while the second one, in red, learns over time. ESBAS, in green, hands in a performance approaching that of the best algorithm at each time step. Indeed, overall ESBAS enjoys a better performance than either of the other two. We pursued the empirical validation on an Atari game called Q*bert. Making several DQN architectures compete, we found out that our system yielded a better final performance than any single architecture on its own.<\/p>\n<div id=\"attachment_590824\" style=\"width: 729px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability_in_RL_Figure_2.jpg.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-590824\" class=\"wp-image-590824 size-full\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability_in_RL_Figure_2.jpg.png\" alt=\"Figure 2: Preliminary experiments.\" width=\"719\" height=\"455\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability_in_RL_Figure_2.jpg.png 719w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability_in_RL_Figure_2.jpg-300x190.png 300w\" sizes=\"(max-width: 719px) 100vw, 719px\" \/><p id=\"caption-attachment-590824\" class=\"wp-caption-text\"><span class=\"sr-only\"> (opens in new tab)<\/span><\/a> Figure 2: Preliminary experiments.<\/p><\/div>\n<p>We\u2019d previously argued ESBAS improves the reliability of the RL algorithms, but in fact it has other virtues. First; it allows staggered learning, preferentially using the algorithm with the best policy at each time step. Second, it allows the use of objective function that may not be directly implemented as rewards (for instance it allows enforcement of safety constraints.) Third, we observe that the ensemble of policies generated more diverse experience about the environment (and richer diversity usually results in richer information). And finally, as we\u2019ve observed with the constant algorithm, EBAS allows smooth transfer between a baseline policy and a learned policy.<\/p>\n<h3>Reliable Policy Improvement<\/h3>\n<p>Our <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/safe-policy-improvement-with-baseline-bootstrapping-2\/\">second stab<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> at improving the reliability of RL algorithms focused on one specific setting, often encountered in real world applications: batch reinforcement learning. Compared to the classical online setting, in batch reinforcement learning the learning agent does not interact directly with the environment (see Figure 3.). Instead, it is a baseline agent. It is fixed and is used to collect data that then is fed into an algorithm to train a new policy. The batch setting is a constraint commonly encountered in real world scenarios. Dialogue systems or video games generally are deployed on personal devices, making it difficult to update frequently. In other settings, such as pharmaceutical tests or crop management, the time scale is so long \u2013 we\u2019re talking several years \u2013 that you need to run the trajectories in parallel, therefore requiring using the same policy.<\/p>\n<div id=\"attachment_590827\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability_in_RL_Figure_3.jpg.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-590827\" class=\"wp-image-590827 size-large\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability_in_RL_Figure_3.jpg-1024x287.png\" alt=\"Figure 3: Batch Reinforcement Learning process.\" width=\"1024\" height=\"287\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability_in_RL_Figure_3.jpg-1024x287.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability_in_RL_Figure_3.jpg-300x84.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability_in_RL_Figure_3.jpg-768x215.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability_in_RL_Figure_3.jpg.png 1558w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><p id=\"caption-attachment-590827\" class=\"wp-caption-text\"><span class=\"sr-only\"> (opens in new tab)<\/span><\/a> Figure 3: Batch Reinforcement Learning process.<\/p><\/div>\n<p>Classic RL algorithms improve on average the baseline. Unfortunately, they cannot do so reliably. Reliable policy improvement is crucial in batch settings because if a policy is bad, it will remain bad for many trajectories. We want algorithms to almost always improve the baseline policy. In order to look at the worst-case scenarios, we looked at the Conditional value at Risk (CvaR). CvaR is actually a simple concept; it is the average of the worst runs. Here, a run is defined as the batch process shown on Figure 3 (data collection and policy training). 1%-CVaR therefore signifies the average of performance over the 1% worst runs.<\/p>\n<p>It is important to explain why classic RL fails. Its only source of information is the dataset that classic RL algorithms implicitly or explicitly use as the proxy for the true environment. But, because of the environment stochasticity or the model function approximation, the proxy is uncertain when the evidence is limited by the dataset. Consequently, the learning algorithm that only has the dataset for information, has many blind spots, and may be over-reliant on the dataset. As a result, classic RL occasionally trains policies that are very bad in the true environment.<br \/>\nAnd indeed, it\u2019s actually worse than that; reinforcement learning will search and find a way to optimize the objective function. It has been shown to be effective at <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/interestingengineering.com\/video\/watch-ai-defeat-this-classic-video-game-by-finding-a-glitch\">finding and exploiting glitches<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> in games\u2013certainly fun in that context. But in a batch setting, you do not want your algorithm to exploit glitches, because the glitches are likely to be fantasized from the blind spots in your data. And the more complex the domain is, the more blind spots you get. Therefore, it\u2019s imperative to ensure that the algorithm is careful with the blind spots.<\/p>\n<p>We solved this challenge by designing a new algorithm called <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/safe-policy-improvement-with-baseline-bootstrapping-2\/\">SPIBB<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (Safe Policy Improvement with Baseline Bootstrapping), in work to be presented at the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/icml.cc\/\">2019 International Conference on Machine Learning<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (ICML 2019). SPIBB implements the following commonsensical rule to the policy update: if you don\u2019t know what you\u2019re doing, don\u2019t do it. More precisely, if there is sufficient data to support the policy change, then it is allowed to do so. Otherwise, just reproduce the baseline policy that was used during the data collection. <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/soft-safe-policy-improvement-with-baseline-bootstrapping\/\">Soft-SPIBB<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (presented at <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/ewrl.wordpress.com\/ewrl14-2018\/\">EWRL 2018<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>) is a variant of the same idea with a softer rule: the policy change is allowed to be changed up to the inverse to the measured uncertainty. SPIBB has also been adapted to factored MDPs (<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"http:\/\/www.st.ewi.tudelft.nl\/mtjspaan\/pub\/Simao19aaai.pdf\">one<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> presented at <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/aaai.org\/Conferences\/AAAI-19\/\">AAAI 2019<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and another to be presented at <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/ijcai19.org\/\">IJCAI 2019<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>).<\/p>\n<div id=\"attachment_590830\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability_in_RL_Figure_4.jpg.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-590830\" class=\"wp-image-590830 size-large\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability_in_RL_Figure_4.jpg-1024x682.png\" alt=\"Figure 4: Mean performance benchmark on a 25-state gridworld.\" width=\"1024\" height=\"682\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability_in_RL_Figure_4.jpg-1024x682.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability_in_RL_Figure_4.jpg-300x200.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability_in_RL_Figure_4.jpg-768x511.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability_in_RL_Figure_4.jpg.png 1217w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><p id=\"caption-attachment-590830\" class=\"wp-caption-text\"><span class=\"sr-only\"> (opens in new tab)<\/span><\/a> Figure 4: Mean performance benchmark on a 25-state gridworld.<\/p><\/div>\n<p>We performed a benchmark on a stochastic gridworld environment with only 25 states and four actions. Depending on the size of the dataset, displayed on the x-axis on a log-scale, we observed the average performance of several algorithms in the literature (see Figure 4.) The baseline performance is shown as a black dotted line in the middle of the figure. All the algorithms improve upon it, on average. We display two variants of the SPIBB algorithm, in red and in purple. Both are among the best algorithms in mean score. In particular, we observe that the classic RL, shown in blue, surprisingly does not really improve with the size of the dataset. The only algorithm on-par or even arguably better than SPIBB is <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"http:\/\/papers.nips.cc\/paper\/6294-safe-policy-improvement-by-minimizing-robust-baseline-regret\">RaMDP<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, in light blue, which has two shortcomings: it relies on a hyperparameter that needs to be very finely tuned and it is not as reliable as the SPIBB algorithms.<\/p>\n<div id=\"attachment_590833\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability_in_RL_Figure_5.jpg.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-590833\" class=\"wp-image-590833 size-large\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability_in_RL_Figure_5.jpg-1024x681.png\" alt=\"Figure 5: 1%-CVaR-performance benchmark on a 25-state gridworld.\" width=\"1024\" height=\"681\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability_in_RL_Figure_5.jpg-1024x681.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability_in_RL_Figure_5.jpg-300x200.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability_in_RL_Figure_5.jpg-768x511.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability_in_RL_Figure_5.jpg.png 1218w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><p id=\"caption-attachment-590833\" class=\"wp-caption-text\"><span class=\"sr-only\"> (opens in new tab)<\/span><\/a> Figure 5: 1%-CVaR-performance benchmark on a 25-state gridworld.<\/p><\/div>\n<p>Figure 5 showed the mean performance, but we are more interested in the reliability of the trained policies. We measure the 1%-CVaR and display it on Figure 5. We observe that basic reinforcement learning (blue), is unreliable. The most reliable algorithms are versions of SPIBB, by a wide margin. As mentioned earlier, RaMDP is unreliable with small datasets.<br \/>\nSimilar results have been obtained on small random environments, with random baselines. The study also provides formal proofs of the reliability of SPIBB in finite MDPs. Finally, the study demonstrated SPIBB efficiency and reliability on a larger continuous navigation task, using deep reinforcement learning with even better results in comparison with the other algorithms. A <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/spibb-dqn-safe-batch-reinforcement-learning-with-function-approximation\/\">larger set of results<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> will further be presented at <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"http:\/\/rldm.org\/\">RLDM 2019<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n<p>I look forward to demonstrating these ideas and getting input at ICML 2019 later this month. In the meantime, be secure, use a good reliable snap hook to work your way up to the pinnacle of deep RL practice and do feel welcome to share your pictures!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Reinforcement Learning (RL), much like scaling a 3,000-foot rock face, is about learning to make sequential decisions. The list of potential RL applications is expansive, spanning robotics (drone control), dialogue systems (personal assistants, automated call centers), the game industry (non-player characters, computer AI), treatment design (pharmaceutical tests, crop management), complex systems control (for resource allocation, [&hellip;]<\/p>\n","protected":false},"author":38022,"featured_media":591157,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"categories":[194467],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-590815","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-artifical-intelligence","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[663258],"related-projects":[615297],"related-events":[590557],"related-researchers":[],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability-In-Reinforcement-Learning_Site_06_2019_1400x788.png\" class=\"img-object-cover\" alt=\"\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability-In-Reinforcement-Learning_Site_06_2019_1400x788.png 1400w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability-In-Reinforcement-Learning_Site_06_2019_1400x788-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability-In-Reinforcement-Learning_Site_06_2019_1400x788-768x432.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability-In-Reinforcement-Learning_Site_06_2019_1400x788-1024x576.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability-In-Reinforcement-Learning_Site_06_2019_1400x788-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability-In-Reinforcement-Learning_Site_06_2019_1400x788-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/06\/Reliability-In-Reinforcement-Learning_Site_06_2019_1400x788-343x193.png 343w\" sizes=\"(max-width: 960px) 100vw, 960px\" \/>","byline":"Romain Laroche","formattedDate":"June 6, 2019","formattedExcerpt":"Reinforcement Learning (RL), much like scaling a 3,000-foot rock face, is about learning to make sequential decisions. The list of potential RL applications is expansive, spanning robotics (drone control), dialogue systems (personal assistants, automated call centers), the game industry (non-player characters, computer AI), treatment design&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/590815"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/38022"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=590815"}],"version-history":[{"count":9,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/590815\/revisions"}],"predecessor-version":[{"id":591160,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/590815\/revisions\/591160"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/591157"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=590815"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=590815"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=590815"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=590815"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=590815"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=590815"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=590815"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=590815"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=590815"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=590815"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=590815"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}