{"id":590815,"date":"2019-06-06T09:56:55","date_gmt":"2019-06-06T16:56:55","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=590815"},"modified":"2019-06-06T11:21:04","modified_gmt":"2019-06-06T18:21:04","slug":"reliability-in-reinforcement-learning","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/reliability-in-reinforcement-learning\/","title":{"rendered":"Reliability in Reinforcement Learning"},"content":{"rendered":"
(opens in new tab)<\/span><\/a><\/p>\n Reinforcement Learning (RL), much like scaling a 3,000-foot rock face, is about learning to make sequential decisions. The list of potential RL applications is expansive, spanning robotics (drone control), dialogue systems (personal assistants, automated call centers), the game industry (non-player characters, computer AI), treatment design (pharmaceutical tests, crop management), complex systems control (for resource allocation, flow optimization), and so on.<\/p>\n Some RL achievements are quite compelling. For example, Stanford University\u2019s RL team learned how to control a reduced model helicopter (opens in new tab)<\/span><\/a>, and even learned new aerobatics with it (opens in new tab)<\/span><\/a>. Orange Labs deployed the first commercial dialogue system optimized with RL (opens in new tab)<\/span><\/a>. DeepMind invented DQN, the first deep RL algorithm (opens in new tab)<\/span><\/a> capable of playing Atari games at human skill level using visual inputs, and were able to train a Go AI by playing it solely against itself (opens in new tab)<\/span><\/a>. It reigns undefeated in its Go matches against human players.<\/p>\n Despite such remarkable achievements, applying RL to most real-world scenarios remains a challenge. There are several reasons for this. Deep RL algorithms are not sample efficient; they require billions of samples to obtain their results and extracting such astronomical numbers of samples in real world applications isn\u2019t feasible. RL also falls short in the area of morality constraints; the algorithms need to be safe. They have to be able to learn in real-life scenarios without risking lives or equipment. The RL algorithms also need to be fair; they must ensure they are not discriminating against people. Finally, the algorithms need to be reliable and deliver consistently solid results.<\/p>\n This blog post focuses on reliability in reinforcement learning. Deep RL algorithms are impressive, but only when they work. In reality, they are largely unreliable (opens in new tab)<\/span><\/a>. Even worse, two runs with different random seeds can yield very different results because of the stochasticity in the reinforcement learning process. We propose two ways of mitigating this.<\/p>\n The first idea (opens in new tab)<\/span><\/a> (published last year at ICLR (opens in new tab)<\/span><\/a>) is quite simple. If one algorithm is not reliable, we train several of them and use the best one. Figure 1 illustrates the algorithmic selection process. At the beginning of each episode, the algorithm selector selects an algorithm from the Portfolio. The selected algorithm outputs a policy that will be utilized for the entirety of the subsequent episode. Then, in the green area, the standard RL loop runs until the completion of the episode. The generated trajectory is recorded and fed to the algorithms for future training. The performance measure is sent to the algorithm selector, so that it selects the fittest algorithms in the future.<\/p>\nAlgorithm Selection<\/h3>\n