{"id":852774,"date":"2022-06-16T08:00:26","date_gmt":"2022-06-16T15:00:26","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&p=852774"},"modified":"2026-04-03T11:44:23","modified_gmt":"2026-04-03T18:44:23","slug":"spibb","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/spibb\/","title":{"rendered":"Safe Policy Improvement with Baseline Bootstrapping"},"content":{"rendered":"
\n\t
\n\t\t
\n\t\t\t\t\t<\/div>\n\t\t\n\t\t
\n\t\t\t\n\t\t\t
\n\t\t\t\t\n\t\t\t\t
\n\t\t\t\t\t\n\t\t\t\t\t
\n\t\t\t\t\t\t
\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n

Safe Policy Improvement with Baseline Bootstrapping<\/h1>\n\n\n\n

<\/p>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n\n\n

In this umbrella project, we investigate a class of conservative Offline RL algorithms that use uncertainty estimators to decide whether they can trust their prediction to optimize their policy of they would better reproduce the policy that was used to collect the dataset.<\/p>\n\n\n\n

This project focuses on Offline RL (opens in new tab)<\/span><\/a> algorithmic development in the space of conservative algorithms, i.e.<\/em> algorithms that constrain the set of candidate policies in such a way that it remains close to the behavioral policy (also called baseline). Our algorithmic contributions to the field have focused more precisely on SPIBB*<\/sup>\u00a0algorithmic family which offers guarantees on the policy improvement granted by the trained policy as compared to the behavioral policy (see\u00a0blog post<\/a>):<\/p>\n\n\n\n