On the Prior Sensitivity of Thompson Sampling
- Lihong Li
Proceedings of the 27th International Conference on Algorithmic Learning Theory (ALT) |
Published by Springer
The empirically successful Thompson Sampling algorithm for stochastic bandits has drawn much interest in understanding its theoretical properties. One important benefit of the algorithm is that it allows domain knowledge to be conveniently encoded as a prior distribution to balance exploration and exploitation more effectively. While it is generally believed that the algorithm’s regret is low (high) when the prior is good (bad), little is known about the exact dependence. This paper is a first step towards answering this important question: focusing on a special yet representative case, we fully characterize the algorithm’s worst-case dependence of regret on the choice of prior. As a corollary, these results also provide useful insights into the general sensitivity of the algorithm to the choice of priors. In particular, with $p$ being the prior probability mass of the true reward-generating model, we prove $O(\sqrt{T/p})$ and $O(\sqrt{(1-p)T})$ regret upper bounds for the poor- and good-prior cases, respectively, as well as matching lower bounds. Our proofs rely on a fundamental property of Thompson Sampling and make heavy use of martingale theory, both of which appear novel in the Thompson-Sampling literature and may be useful for studying other behavior of the algorithm.