Offline Reinforcement Learning Algorithms

In this page, we describe the algorithmic landscape of Offline RL and enumerate some algorithmic development efforts made by MSR in this space

In a tutorial lecture (opens in new tab) on Offline RL (opens in new tab), we analyze its algorithmic landscape and come up with a classification in five categories:

  • Multiple-MDP algorithms contemplate the multiplicity of the plausible MDPs.
  • Pessimistic algorithms transform the value function with a component penalizing taking actions with high uncertainty.
  • Conservative algorithms constrain the set of candidate policies in such a way that it remains close to the behavioral policy.
  • Early-stopping algorithms apply a limit in the number of updates allowed to train the new policy.
  • Reward-conditioned supervised learning learns a model of the action distribution yielding a target return.

At MSR, we have produced several algorithmic development efforts: