Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds

Published Online:https://doi.org/10.1287/moor.2022.1266

References

  • [1] Abbasi-Yadkori Y, Szepesvari C (2014) Bayesian optimal control of smoothly parameterized systems: The lazy posterior sampling algorithm. Preprint, submitted June 16, https://arxiv.org/abs/1406.3926.Google Scholar
  • [2] Abramowitz M, Stegun IA (1964) Handbook of Mathematical Functions: With Formulas, Graphs, and Mathematical Tables, vol. 55 (Courier Corporation).Google Scholar
  • [3] Agrawal S, Goyal N (2012) Analysis of Thompson sampling for the multi-armed bandit problem. Mannor S, Srebro N, Williamson RC, eds. Proc. 25th Annual Conf. on Learn. Theory (PMLR), 39.1–39.26.Google Scholar
  • [4] Agrawal S, Goyal N (2013a) Further optimal regret bounds for Thompson sampling. Carvalho CM, Ravikumar P, eds. Proc. 16th Internat. Conf. Artificial Intelligence and Statistics (PMLR), 99–107.Google Scholar
  • [5] Agrawal S, Goyal N (2013b) Thompson sampling for contextual bandits with linear payoffs. Dasgupta S, McAllester D, eds. Proc. 30th Internat. Conf. on Machine Learn. (PMLR).Google Scholar
  • [6] Albert I, Denis JB (2012) Dirichlet and multinomial distributions: Properties and uses in jags. Unite Mathematiques et Informatique Appliquees (INRA), 2012–2015.Google Scholar
  • [7] Asmuth J, Li L, Littman ML, Nouri A, Wingate D (2009) A Bayesian sampling approach to exploration in reinforcement learning. Proc. 25th Conf. on Uncertainty in Artificial Intelligence (AUAI Press), 19–26.Google Scholar
  • [8] Azar MG, Osband I, Munos R (2017) Minimax regret bounds for reinforcement learning. Preprint, submitted July 1, https://arxiv.org/abs/1703.05449.Google Scholar
  • [9] Bartlett PL, Tewari A (2009) REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. Proc. 25th Conf. on Uncertainty in Artificial Intelligence (AUAI Press), 35–42.Google Scholar
  • [10] Brafman RI, Tennenholtz M (2002) R-max-a general polynomial time algorithm for near-optimal reinforcement learning. J. Machine Learn. Res. 3(Oct):213–231.Google Scholar
  • [11] Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learn. 5(1):1–122.CrossrefGoogle Scholar
  • [12] Bubeck S, Liu CY (2013) Prior-free and prior-dependent regret bounds for Thompson sampling. Adv. Neural Inform. Processing Systems 26:638–646.Google Scholar
  • [13] Burnetas AN, Katehakis MN (1997) Optimal adaptive policies for Markov decision processes. Math. Oper. Res. 22(1):222–255.LinkGoogle Scholar
  • [14] Chapelle O, Li L (2011) An empirical evaluation of Thompson sampling. Adv. Neural Inf. Processing Systems 24:2249–2257.Google Scholar
  • [15] Dann C, Brunskill E (2015) Sample complexity of episodic fixed-horizon reinforcement learning. Adv. Neural Inform. Processing Systems 28:2818–2826.Google Scholar
  • [16] Fonteneau R, Korda N, Munos R (2013) An optimistic posterior sampling strategy for bayesian reinforcement learning. Proc. NIPS Workshop on Bayesian Optimization.Google Scholar
  • [17] Fruit R, Pirotta M, Lazaric A (2018) Near optimal exploration-exploitation in non-communicating markov decision processes. Adv. Neural Inform. Processing Systems 31:31.Google Scholar
  • [18] Fruit R, Pirotta M, Lazaric A (2020) Improved analysis of ucrl2 with empirical bernstein inequality. Preprint, submitted July 10, https://arxiv.org/abs/2007.05456.Google Scholar
  • [19] Grinstead CM, Snell JL (2012) Introduction to Probability (American Mathematical Society, Providence, RI).Google Scholar
  • [20] Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J. Machine Learn. Res. 11(Apr):1563–1600.Google Scholar
  • [21] Kakade SM, Wang M, Yang LF (2018) Variance reduction methods for sublinear reinforcement learning. Preprint, submitted June 27, https://arxiv.org/abs/1802.09184.Google Scholar
  • [22] Kakade SM, et al. (2003) On the sample complexity of reinforcement learning. PhD thesis, University of London, London.Google Scholar
  • [23] Kaufmann E, Korda N, Munos R (2012) Thompson sampling: An optimal finite time analysis. Proc. Internat. Conf. on Algorithmic Learn. Theory (Springer).Google Scholar
  • [24] Kearns MJ, Singh SP (1999) Finite-sample convergence rates for Q-learning and indirect algorithms. Adv. Neural Inform. Processing Systems 11:996–1002.Google Scholar
  • [25] Kleinberg R, Slivkins A, Upfal E (2008) Multi-armed bandits in metric spaces. Proc. 40th Annual ACM Sympos. on Theory of Comput. (ACM, New York), 681–690.Google Scholar
  • [26] Ortner R (2020) Regret bounds for reinforcement learning via Markov chain concentration. J. Artificial Intelligence Res. 67:115–128.CrossrefGoogle Scholar
  • [27] Osband I, Van Roy B (2017) Why is posterior sampling better than optimism for reinforcement learning. Preprint, submitted June 13, https://arxiv.org/abs/1607.00215.Google Scholar
  • [28] Osband I, Russo D, Van Roy B (2013) (More) efficient reinforcement learning via posterior sampling. Adv. Neural Inform. Processing Systems 26:3003–3011.Google Scholar
  • [29] Osband I, Van Roy B, Wen Z (2016) Generalization and exploration via randomized value functions. Preprint, submitted February 15, https://arxiv.org/abs/1402.0635.Google Scholar
  • [30] Ouyang Y, Gagrani M, Nayyar A, Jain R (2017) Learning unknown Markov decision processes: A Thompson sampling approach. Preprint, submitted September 14, https://arxiv.org/abs/1709.04570.Google Scholar
  • [31] Puterman ML (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, New York).Google Scholar
  • [32] Russo D, Van Roy B (2014) Learning to optimize via posterior sampling. Math. Oper. Res. 39(4):1221–1243.LinkGoogle Scholar
  • [33] Russo D, Van Roy B (2016) An information-theoretic analysis of Thompson sampling. J. Machine Learn. Res. 17(4):1–30.Google Scholar
  • [34] Seldin Y, Laviolette F, Cesa-Bianchi N, Shawe-Taylor J, Auer P (2012) PAC-Bayesian inequalities for martingales. IEEE Trans. Inform. Theory 58(12):7086–7093.CrossrefGoogle Scholar
  • [35] Shevtsova IG (2010) An improvement of convergence rate estimates in the Lyapunov theorem. Doklady Math. 82(3):862–864.CrossrefGoogle Scholar
  • [36] Strehl AL, Littman ML (2005) A theoretical analysis of model-based interval estimation. Proc. 22nd Internat. Conf. on Machine Learn. (ACM, New York), 856–863.Google Scholar
  • [37] Strehl AL, Littman ML (2008) An analysis of model-based interval estimation for Markov decision processes. J. Comput. System Sci. 74(8):1309–1331.CrossrefGoogle Scholar
  • [38] Talebi MS, Maillard OA (2018) Variance-aware regret bounds for undiscounted reinforcement learning in mdps. Algorithmic Learning Theory (PMLR), 770–805.Google Scholar
  • [39] Tewari A, Bartlett PL (2008) Optimistic linear programming gives logarithmic regret for irreducible MDPs. Adv. Neural Inform. Processing Systems 20:1505–1512.Google Scholar
  • [40] Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.CrossrefGoogle Scholar
  • [41] Zhang Z, Ji X (2019) Regret minimization for reinforcement learning by evaluating the optimal bias function. Preprint, submitted December 28, https://arxiv.org/abs/1906.05110.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.