Towards a generalized policy iteration theorem

We intend to advance the theoretical understanding of actor-critic algorithms under the lens of policy iteration.

Policy Iteration consists in a loop over two processing steps: policy evaluation and policy improvement. Policy Iteration has strong convergence properties when the policy evaluation is exact and the policy improvement is greedy. However, the convergence of a generalized setting where policy evaluation is approximate and stochastic and the policy improvement is a local update remains an open problem, which this umbrella project intends to address.

Several challenges need to be addressed, either independently or jointly:

Approximation in the policy evaluation: in practice, value updates do not return the exact value of the current policy,
Stochasticity in the updates: in practice, they are performed from samples either directly collected by interacting with the environment, or from a replay buffer,
Soft policy updates: although the greedy update is practical, it is often preferrable for stability and exploration to perform soft updates that slowly move the current policy towards the greedy one,
Function approximation: in complex domain, one must use function approximation, such as neural networks, to represent both the policy and its value.

Our first contribution (AAMAS 2022) triggered this research direction. The paper explores empirically the mismatch between actor-critic updates used by practitioners and those prescribed by the policy gradient theorem. It suggests two interpretations of the performance improvement provided by the modified update, and naturally begs the question of the underlying properties updates should verify to guarantee convergence to optimality.

Our second contribution (opens in new tab) (NeurIPS 2021) consists in casting the actor-critic architecture into the generalization policy iteration framework, and proving the necessary and sufficient conditions for density-agnostic policy gradient policy updates to converge to optimality when coupled with exact policy evaluation. These conditions relate to asymptotic coverage of the state-space when applying the policy update, which advocates for a specific algorithm family that we call Dr Jekyll & Mr Hyde (opens in new tab) because of its two-sided behavior.

Get the code

Our third contribution (AISTATS 2022) shows that similar results may be obtained for updates that are not policy gradients and that these updates may address some slowness in learning that policy gradients may suffer from.

Our latest contribution (submitted to JMLR) establishes the global optimality and convergence rate of an off-policy actor critic algorithm in the tabular setting without using density ratio to correct the discrepancy between the state distribution of the behavior policy and that of the target policy. Our work goes beyond existing works on the optimality of policy gradient methods in that existing works use the exact policy gradient for updating the policy parameters while we use an approximate and stochastic update step.