Offline Reinforcement Learning Algorithms

In this page, we describe the algorithmic landscape of Offline RL and enumerate some algorithmic development efforts made by MSR in this space

In a tutorial lecture (opens in new tab) on Offline RL (opens in new tab), we analyze its algorithmic landscape and come up with a classification in five categories:

Multiple-MDP algorithms contemplate the multiplicity of the plausible MDPs.
Pessimistic algorithms transform the value function with a component penalizing taking actions with high uncertainty.
Conservative algorithms constrain the set of candidate policies in such a way that it remains close to the behavioral policy.
Early-stopping algorithms apply a limit in the number of updates allowed to train the new policy.
Reward-conditioned supervised learning learns a model of the action distribution yielding a target return.

At MSR, we have produced several algorithmic development efforts:

Pessimistic algorithms: Provably Good Batch Off-Policy Reinforcement Learning Without Great Exploration (opens in new tab) [NeurIPS’20]. Assumptions are necessary when training a policy from a fixed dataset without any environment interactions. Previous offline algorithms made an unreasonable “concentrability” assumption, and could fail catastrophically when it was violated. In this work we showed that a suitably pessimistic algorithm does not require concentrability and gracefully fails as the quality of the training dataset is degraded.
Conservative algorithms: Safe Policy Improvement with Baseline Bootstrapping (opens in new tab) is a family of conservative algorithms that allows change in the behavioral policy only when sufficient statistical evidence is provided. This thread of research led us to develop its theoretical foundations (opens in new tab), algorithmic improvements (opens in new tab), additional guarantees with an estimated behavioral policy (opens in new tab) (Markovian or non-Markovian (opens in new tab)), its extension to Multi-objective Offline RL (opens in new tab), and its implementation to deep Offline RL (with a heuristic uncertainty measure (opens in new tab) and with a trained uncertainty measure (opens in new tab)).
Other algorithms: Heuristic-Guided Reinforcement Learning (opens in new tab) [NeurIPS’21]. The typical output of an offline RL algorithm is a learned policy. Are there other useful outputs for downstream decision-making? We can additionally extract heuristics (e.g., as commonly used in A* search) that can for instance accelerate subsequent RL algorithms. A good heuristic can substantially reduce the effective horizon for decision-making and simplify the RL problem. In this work we showed how to extract good heuristics (closely linked to the pessimism and conservatism principles in offline RL) and how to use them effectively (closely linked to Blackwell optimality and potential-based reward shaping).