{"id":852774,"date":"2022-06-16T08:00:26","date_gmt":"2022-06-16T15:00:26","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&#038;p=852774"},"modified":"2026-04-03T11:44:23","modified_gmt":"2026-04-03T18:44:23","slug":"spibb","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/spibb\/","title":{"rendered":"Safe Policy Improvement with Baseline Bootstrapping"},"content":{"rendered":"<section class=\"mb-3 moray-highlight\">\n\t<div class=\"card-img-overlay mx-lg-0\">\n\t\t<div class=\"card-background bg-gray-200 has-background- card-background--full-bleed\">\n\t\t\t\t\t<\/div>\n\t\t<!-- Foreground -->\n\t\t<div class=\"card-foreground d-flex mt-md-n5 my-lg-5 px-g px-lg-0\">\n\t\t\t<!-- Container -->\n\t\t\t<div class=\"container d-flex mt-md-n5 my-lg-5 align-self-center\">\n\t\t\t\t<!-- Card wrapper -->\n\t\t\t\t<div class=\"w-100 w-lg-col-5\">\n\t\t\t\t\t<!-- Card -->\n\t\t\t\t\t<div class=\"card material-md-card py-5 px-md-5\">\n\t\t\t\t\t\t<div class=\"card-body \">\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n<h1 class=\"wp-block-heading\" id=\"safe-policy-improvement-with-baseline-bootstrapping\">Safe Policy Improvement with Baseline Bootstrapping<\/h1>\n\n\n\n<p><\/p>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n\n\n<p>In this umbrella project, we investigate a class of conservative Offline RL algorithms that use uncertainty estimators to decide whether they can trust their prediction to optimize their policy of they would better reproduce the policy that was used to collect the dataset.<\/p>\n\n\n\n<p>This project focuses on <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/aka.ms\/offlinerl\">Offline RL<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> algorithmic development in the space of conservative algorithms, <em>i.e.<\/em> algorithms that constrain the set of candidate policies in such a way that it remains close to the behavioral policy (also called baseline). Our algorithmic contributions to the field have focused more precisely on SPIBB<sup>*<\/sup>\u00a0algorithmic family which offers guarantees on the policy improvement granted by the trained policy as compared to the behavioral policy (see\u00a0<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/reliability-in-reinforcement-learning\/\">blog post<\/a>):<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/safe-policy-improvement-with-baseline-bootstrapping-2\/\">SPIBB<\/a>\u00a0[<a>ICML\u201919] <\/a><a href=\"#_msocom_1\">[RL1]<\/a>\u00a0provides the seminal algorithm with a theoretical and empirical analysis in the tabular MDP setting. Its motto:\u00a0<em>Allow change only when you have sufficient evidence that it is for the better.<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/safe-policy-improvement-with-soft-baseline-bootstrapping\/\">Soft-SPIBB<\/a>\u00a0[ECML\u201920] provides a softening of the SPIBB algorithm with a theoretical and empirical analysis in the tabular MDP setting. Its motto:\u00a0<em>Allow change for the better proportionally to the amount of evidence.<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/safe-policy-improvement-with-an-estimated-baseline-policy\/\">Estimated-baseline SPIBB<\/a>\u00a0[AAMAS\u201920] proves that both SPIBB and Soft-SPIBB keep their guarantees when the behavioral policy is not given, but learnt with behavioral cloning.<\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/multi-objective-spibb-seldonian-offline-policy-improvement-with-safety-constraints-in-finite-mdps\/\">Multi-Objective SPIBB<\/a>\u00a0[NeurIPS\u201921] provides an algorithmic, theoretical, and empirical extension to the setting of Multi-Objective RL.<\/li>\n\n\n\n<li>Deep SPIBB\u00a0[<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/04\/RLDM___SPIBB_DQN-2.pdf\">RLDM\u201919<\/a> + <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2206.01085\">in review<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>] algorithmically and empirically studies the adaptation of SPIBB algorithms to complex tasks requiring function approximation.<\/li>\n<\/ul>\n\n\n","protected":false},"excerpt":{"rendered":"<p>In this umbrella project, we investigate a class of conservative Offline RL algorithms that use uncertainty estimators to decide whether they can trust their prediction to optimize their policy of they would better reproduce the policy that was used to collect the dataset. This project focuses on Offline RL (opens in new tab) algorithmic development [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13561,13556],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-852774","msr-project","type-msr-project","status-publish","hentry","msr-research-area-algorithms","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-archive-status-complete"],"msr_project_start":"","related-publications":[],"related-downloads":[],"related-videos":[],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[590815],"related-articles":[],"tab-content":[],"slides":[],"related-researchers":[],"msr_research_lab":[],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/852774","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":3,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/852774\/revisions"}],"predecessor-version":[{"id":1167664,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/852774\/revisions\/1167664"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=852774"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=852774"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=852774"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=852774"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=852774"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}