Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds

Published Online:https://doi.org/10.1287/moor.2022.1266

We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov decision process (MDP) is communicating with a finite, although unknown, diameter. Our main result is a high probability regret upper bound of O˜(DSAT) for any communicating MDP with S states, A actions, and diameter D. Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy in time horizon T. This result closely matches the known lower bound of Ω(DSAT). Our techniques involve proving some novel results about the anti-concentration of Dirichlet distribution, which may be of independent interest.

Funding: This work was supported in part by an NSF CAREER award [CMMI 1846792] awarded to author S. Agrawal.

INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.