{"id":182518,"date":"2008-05-07T00:00:00","date_gmt":"2009-10-31T09:43:48","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/msr-research-item\/reachability-under-uncertainty-bayesian-inverse-reinforcement-learning\/"},"modified":"2016-09-09T09:42:51","modified_gmt":"2016-09-09T16:42:51","slug":"reachability-under-uncertainty-bayesian-inverse-reinforcement-learning","status":"publish","type":"msr-video","link":"https:\/\/www.microsoft.com\/en-us\/research\/video\/reachability-under-uncertainty-bayesian-inverse-reinforcement-learning\/","title":{"rendered":"Reachability Under Uncertainty & Bayesian Inverse Reinforcement Learning"},"content":{"rendered":"
\n

This talk will present two advances made recently in my group. First, I will introduce a new network reachability problem where the goal is to find the most reliable path between two nodes in a network, represented as a directed acyclic graph. Individual edges within this network may fail according to certain probabilities, and these failure probabilities may depend on the values of one or more hidden variables. I will explain why this problem is harder than similar problems encountered in standard probabilistic inference. I will also an efficient approximation algorithm for this problem, and discuss open issues.<\/p>\n

The second advance is a generalization of Inverse Reinforcement Learning (IRL). IRL is the problem of learning the reward function underlying a Markov Decision Process given the dynamics of the system and the behaviour of an expert. It is motivated by situations where knowledge of the rewards is a goal by itself (as in preference elicitation) and by the task of apprenticeship learning (learning policies from an expert). In this part of the talk I will show how to combine prior knowledge and evidence from the expert’s actions to derive a probability distribution over the space of reward functions. I will present efficient algorithms that find solutions for the reward learning and apprenticeship learning tasks that generalize well over these distributions. Experimental results show strong improvement for this methods over previous heuristic-based approaches.<\/p>\n