{"id":615297,"date":"2019-10-15T06:28:30","date_gmt":"2019-10-15T13:28:30","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&#038;p=615297"},"modified":"2022-06-16T08:11:31","modified_gmt":"2022-06-16T15:11:31","slug":"offlinerl","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/offlinerl\/","title":{"rendered":"Offline Reinforcement Learning"},"content":{"rendered":"<p>This page introduces the research area of Offline Reinforcement Learning (also sometimes called Batch Reinforcement Learning). It consists in training a target policy from a fixed dataset of trajectories collected with a behavioral policy. In comparison to classic Reinforcement Learning (RL), the learning agent cannot interact with the environment preventing the use of the virtuous trial-and-error feedback loop.<\/p>\n<p>More precisely, in standard RL, the agent directly interacts with the environment while learning. In contrast, with Offline RL, a dataset of trajectories is collected by an agent, called the behavioral, which the learner has no control over. First, the behavioral agent interacts with the environment following a behavioral policy. These interactions form trajectories that are collected and stored in a dataset. Then, on the basis of this trajectory dataset, the Offline RL algorithm is asked to generate a new policy without direct access to the environment. This process is depicted in the figure hereafter.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-852789 size-large\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/10\/IMG_20220615_171002-1024x259.jpg\" alt=\"offline rl diagram\" width=\"1024\" height=\"259\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/10\/IMG_20220615_171002-1024x259.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/10\/IMG_20220615_171002-300x76.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/10\/IMG_20220615_171002-768x194.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/10\/IMG_20220615_171002-1536x388.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/10\/IMG_20220615_171002-2048x518.jpg 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/10\/IMG_20220615_171002-240x61.jpg 240w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p>At MSR, we have recorded a <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"http:\/\/aka.ms\/offrl\">tutorial lecture<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> on Offline RL and have contributed to <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"http:\/\/aka.ms\/offlinerlalgo\">algorithmic development<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"http:\/\/aka.ms\/offlinerltheory\">theoretical foundations<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> for Offline RL.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This page introduces the research area of Offline Reinforcement Learning (also sometimes called Batch Reinforcement Learning). It consists in training a target policy from a fixed dataset of trajectories collected with a behavioral policy. In comparison to classic Reinforcement Learning (RL), the learning agent cannot interact with the environment preventing the use of the virtuous [&hellip;]<\/p>\n","protected":false},"featured_media":683142,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13556],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-615297","msr-project","type-msr-project","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"","related-publications":[581008,600384,659082,797209,888921],"related-downloads":[],"related-videos":[],"related-groups":[663258,896463],"related-events":[],"related-opportunities":[],"related-posts":[590815],"related-articles":[],"tab-content":[],"slides":[],"related-researchers":[],"msr_research_lab":[437514],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/615297","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":10,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/615297\/revisions"}],"predecessor-version":[{"id":852792,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/615297\/revisions\/852792"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/683142"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=615297"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=615297"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=615297"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=615297"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=615297"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}