Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds

Shipra Agrawal
Shipra Agrawal
[email protected]
https://orcid.org/0000-0003-4486-3871
Department of Industrial Engineering and Operations Research, Columbia University, New York, New York 10027
Search for more papers by this author
,
Randy Jia
Corresponding Author
Randy Jia
[email protected]
https://orcid.org/0000-0002-7101-9572
Department of Industrial Engineering and Operations Research, Columbia University, New York, New York 10027
Search for more papers by this author

Department of Industrial Engineering and Operations Research, Columbia University, New York, New York 10027

Search for more papers by this author

Randy Jia

Corresponding Author

Randy Jia

[email protected]

https://orcid.org/0000-0002-7101-9572

Department of Industrial Engineering and Operations Research, Columbia University, New York, New York 10027

Search for more papers by this author

Published Online:6 May 2022https://doi.org/10.1287/moor.2022.1266

References

[1] Abbasi-Yadkori Y, Szepesvari C (2014) Bayesian optimal control of smoothly parameterized systems: The lazy posterior sampling algorithm. Preprint, submitted June 16, https://arxiv.org/abs/1406.3926.Google Scholar
[2] Abramowitz M, Stegun IA (1964) Handbook of Mathematical Functions: With Formulas, Graphs, and Mathematical Tables, vol. 55 (Courier Corporation).Google Scholar
[3] Agrawal S, Goyal N (2012) Analysis of Thompson sampling for the multi-armed bandit problem. Mannor S, Srebro N, Williamson RC, eds. Proc. 25th Annual Conf. on Learn. Theory (PMLR), 39.1–39.26.Google Scholar
[4] Agrawal S, Goyal N (2013a) Further optimal regret bounds for Thompson sampling. Carvalho CM, Ravikumar P, eds. Proc. 16th Internat. Conf. Artificial Intelligence and Statistics (PMLR), 99–107.Google Scholar
[5] Agrawal S, Goyal N (2013b) Thompson sampling for contextual bandits with linear payoffs. Dasgupta S, McAllester D, eds. Proc. 30th Internat. Conf. on Machine Learn. (PMLR).Google Scholar
[6] Albert I, Denis JB (2012) Dirichlet and multinomial distributions: Properties and uses in jags. Unite Mathematiques et Informatique Appliquees (INRA), 2012–2015.Google Scholar
[7] Asmuth J, Li L, Littman ML, Nouri A, Wingate D (2009) A Bayesian sampling approach to exploration in reinforcement learning. Proc. 25th Conf. on Uncertainty in Artificial Intelligence (AUAI Press), 19–26.Google Scholar
[8] Azar MG, Osband I, Munos R (2017) Minimax regret bounds for reinforcement learning. Preprint, submitted July 1, https://arxiv.org/abs/1703.05449.Google Scholar
[9] Bartlett PL, Tewari A (2009) REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. Proc. 25th Conf. on Uncertainty in Artificial Intelligence (AUAI Press), 35–42.Google Scholar
[10] Brafman RI, Tennenholtz M (2002) R-max-a general polynomial time algorithm for near-optimal reinforcement learning. J. Machine Learn. Res. 3(Oct):213–231.Google Scholar
[11] Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learn. 5(1):1–122.Crossref, Google Scholar
[12] Bubeck S, Liu CY (2013) Prior-free and prior-dependent regret bounds for Thompson sampling. Adv. Neural Inform. Processing Systems 26:638–646.Google Scholar
[13] Burnetas AN, Katehakis MN (1997) Optimal adaptive policies for Markov decision processes. Math. Oper. Res. 22(1):222–255.Link, Google Scholar
[14] Chapelle O, Li L (2011) An empirical evaluation of Thompson sampling. Adv. Neural Inf. Processing Systems 24:2249–2257.Google Scholar
[15] Dann C, Brunskill E (2015) Sample complexity of episodic fixed-horizon reinforcement learning. Adv. Neural Inform. Processing Systems 28:2818–2826.Google Scholar
[16] Fonteneau R, Korda N, Munos R (2013) An optimistic posterior sampling strategy for bayesian reinforcement learning. Proc. NIPS Workshop on Bayesian Optimization.Google Scholar
[17] Fruit R, Pirotta M, Lazaric A (2018) Near optimal exploration-exploitation in non-communicating markov decision processes. Adv. Neural Inform. Processing Systems 31:31.Google Scholar
[18] Fruit R, Pirotta M, Lazaric A (2020) Improved analysis of ucrl2 with empirical bernstein inequality. Preprint, submitted July 10, https://arxiv.org/abs/2007.05456.Google Scholar
[19] Grinstead CM, Snell JL (2012) Introduction to Probability (American Mathematical Society, Providence, RI).Google Scholar
[20] Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J. Machine Learn. Res. 11(Apr):1563–1600.Google Scholar
[21] Kakade SM, Wang M, Yang LF (2018) Variance reduction methods for sublinear reinforcement learning. Preprint, submitted June 27, https://arxiv.org/abs/1802.09184.Google Scholar
[22] Kakade SM, et al. (2003) On the sample complexity of reinforcement learning. PhD thesis, University of London, London.Google Scholar
[23] Kaufmann E, Korda N, Munos R (2012) Thompson sampling: An optimal finite time analysis. Proc. Internat. Conf. on Algorithmic Learn. Theory (Springer).Google Scholar
[24] Kearns MJ, Singh SP (1999) Finite-sample convergence rates for Q-learning and indirect algorithms. Adv. Neural Inform. Processing Systems 11:996–1002.Google Scholar
[25] Kleinberg R, Slivkins A, Upfal E (2008) Multi-armed bandits in metric spaces. Proc. 40th Annual ACM Sympos. on Theory of Comput. (ACM, New York), 681–690.Google Scholar
[26] Ortner R (2020) Regret bounds for reinforcement learning via Markov chain concentration. J. Artificial Intelligence Res. 67:115–128.Crossref, Google Scholar
[27] Osband I, Van Roy B (2017) Why is posterior sampling better than optimism for reinforcement learning. Preprint, submitted June 13, https://arxiv.org/abs/1607.00215.Google Scholar
[28] Osband I, Russo D, Van Roy B (2013) (More) efficient reinforcement learning via posterior sampling. Adv. Neural Inform. Processing Systems 26:3003–3011.Google Scholar
[29] Osband I, Van Roy B, Wen Z (2016) Generalization and exploration via randomized value functions. Preprint, submitted February 15, https://arxiv.org/abs/1402.0635.Google Scholar
[30] Ouyang Y, Gagrani M, Nayyar A, Jain R (2017) Learning unknown Markov decision processes: A Thompson sampling approach. Preprint, submitted September 14, https://arxiv.org/abs/1709.04570.Google Scholar
[31] Puterman ML (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, New York).Google Scholar
[32] Russo D, Van Roy B (2014) Learning to optimize via posterior sampling. Math. Oper. Res. 39(4):1221–1243.Link, Google Scholar
[33] Russo D, Van Roy B (2016) An information-theoretic analysis of Thompson sampling. J. Machine Learn. Res. 17(4):1–30.Google Scholar
[34] Seldin Y, Laviolette F, Cesa-Bianchi N, Shawe-Taylor J, Auer P (2012) PAC-Bayesian inequalities for martingales. IEEE Trans. Inform. Theory 58(12):7086–7093.Crossref, Google Scholar
[35] Shevtsova IG (2010) An improvement of convergence rate estimates in the Lyapunov theorem. Doklady Math. 82(3):862–864.Crossref, Google Scholar
[36] Strehl AL, Littman ML (2005) A theoretical analysis of model-based interval estimation. Proc. 22nd Internat. Conf. on Machine Learn. (ACM, New York), 856–863.Google Scholar
[37] Strehl AL, Littman ML (2008) An analysis of model-based interval estimation for Markov decision processes. J. Comput. System Sci. 74(8):1309–1331.Crossref, Google Scholar
[38] Talebi MS, Maillard OA (2018) Variance-aware regret bounds for undiscounted reinforcement learning in mdps. Algorithmic Learning Theory (PMLR), 770–805.Google Scholar
[39] Tewari A, Bartlett PL (2008) Optimistic linear programming gives logarithmic regret for irreducible MDPs. Adv. Neural Inform. Processing Systems 20:1505–1512.Google Scholar
[40] Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.Crossref, Google Scholar
[41] Zhang Z, Ji X (2019) Regret minimization for reinforcement learning by evaluating the optimal bias function. Preprint, submitted December 28, https://arxiv.org/abs/1906.05110.Google Scholar

cover image Mathematics of Operations Research

Volume 48, Issue 1

February 2023

Pages 1-602, C2

Article Information

Metrics

Information

Received:September 26, 2020
Accepted:February 12, 2022
Published Online:May 06, 2022

Cite as

Shipra Agrawal, Randy Jia (2022) Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds. Mathematics of Operations Research 48(1):363-392.

https://doi.org/10.1287/moor.2022.1266

Keywords

Acknowledgments

The authors thank Tor Lattimore for pointing out a mistake in an earlier version of this work and Ian Osband for the fruitful discussions toward resolving the said mistake.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds

References

Volume 48, Issue 1

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News