Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds

Shipra Agrawal
Shipra Agrawal
[email protected]
https://orcid.org/0000-0003-4486-3871
Department of Industrial Engineering and Operations Research, Columbia University, New York, New York 10027
Search for more papers by this author
,
Randy Jia
Corresponding Author
Randy Jia
[email protected]
https://orcid.org/0000-0002-7101-9572
Department of Industrial Engineering and Operations Research, Columbia University, New York, New York 10027
Search for more papers by this author

Department of Industrial Engineering and Operations Research, Columbia University, New York, New York 10027

Search for more papers by this author

Randy Jia

Corresponding Author

Randy Jia

[email protected]

https://orcid.org/0000-0002-7101-9572

Department of Industrial Engineering and Operations Research, Columbia University, New York, New York 10027

Search for more papers by this author

Published Online:6 May 2022https://doi.org/10.1287/moor.2022.1266

Abstract

We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov decision process (MDP) is communicating with a finite, although unknown, diameter. Our main result is a high probability regret upper bound of $\tilde{O} (D S \sqrt{A T})$ for any communicating MDP with S states, A actions, and diameter D. Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy in time horizon T. This result closely matches the known lower bound of $Ω (\sqrt{DSAT})$ . Our techniques involve proving some novel results about the anti-concentration of Dirichlet distribution, which may be of independent interest.

Funding: This work was supported in part by an NSF CAREER award [CMMI 1846792] awarded to author S. Agrawal.

cover image Mathematics of Operations Research

Volume 48, Issue 1

February 2023

Pages 1-602, C2

Article Information

Metrics

Information

Received:September 26, 2020
Accepted:February 12, 2022
Published Online:May 06, 2022

Cite as

Shipra Agrawal, Randy Jia (2022) Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds. Mathematics of Operations Research 48(1):363-392.

https://doi.org/10.1287/moor.2022.1266

Keywords

Acknowledgments

The authors thank Tor Lattimore for pointing out a mistake in an earlier version of this work and Ian Osband for the fruitful discussions toward resolving the said mistake.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds

Abstract

Volume 48, Issue 1

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News