Open Problem: First-Order Regret Bounds for Contextual Bandits

Conference on Learning Theory |

Published by PMLR

We describe two open problems related to first order regret bounds for contextual bandits. The first asks for an algorithm with a regret bound of \(O^\tilde(L^*\sqrt{K\ln N})\) where there are \(K\) actions, \(N\) policies, and \(L^*\) is the cumulative loss of the best policy. The second asks for an optimization-oracle-efficient algorithm with regret \(O^\tilde({L^*}^{2/3}poly(K,\ln(N/\delta)))\). We describe some positive results, such as an inefficient algorithm for the second problem, and some partial negative results.