{"id":826651,"date":"2022-03-15T06:04:38","date_gmt":"2022-03-15T13:04:38","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&p=826651"},"modified":"2022-03-24T14:53:45","modified_gmt":"2022-03-24T21:53:45","slug":"towards-a-generalized-policy-iteration-theorem","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/towards-a-generalized-policy-iteration-theorem\/","title":{"rendered":"Towards a generalized policy iteration theorem"},"content":{"rendered":"
\n\t
\n\t\t
\n\t\t\t\"Dr\t\t<\/div>\n\t\t\n\t\t
\n\t\t\t\n\t\t\t
\n\t\t\t\t\n\t\t\t\t
\n\t\t\t\t\t\n\t\t\t\t\t
\n\t\t\t\t\t\t
\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n

Towards a generalized policy iteration theorem<\/h1>\n\n\n\n

We intend to advance the theoretical understanding of actor-critic algorithms under the lens of policy iteration.<\/p>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n\n\n

Policy Iteration consists in a loop over two processing steps: policy evaluation and policy improvement. Policy Iteration has strong convergence properties when the policy evaluation is exact and the policy improvement is greedy. However, the convergence of a generalized setting where policy evaluation is approximate and stochastic and the policy improvement is a local update remains an open problem, which this umbrella project intends to address.<\/p>\n\n\n\n

Several challenges need to be addressed, either independently or jointly:<\/p>\n\n\n\n