Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds
References
- [1] (2014) Bayesian optimal control of smoothly parameterized systems: The lazy posterior sampling algorithm. Preprint, submitted June 16, https://arxiv.org/abs/1406.3926.Google Scholar
- [2] (1964) Handbook of Mathematical Functions: With Formulas, Graphs, and Mathematical Tables, vol. 55 (Courier Corporation).Google Scholar
- [3] (2012) Analysis of Thompson sampling for the multi-armed bandit problem. Mannor S, Srebro N, Williamson RC, eds. Proc. 25th Annual Conf. on Learn. Theory (PMLR), 39.1–39.26.Google Scholar
- [4] Agrawal S, Goyal N (2013a) Further optimal regret bounds for Thompson sampling. Carvalho CM, Ravikumar P, eds. Proc. 16th Internat. Conf. Artificial Intelligence and Statistics (PMLR), 99–107.Google Scholar
- [5] (2013b) Thompson sampling for contextual bandits with linear payoffs. Dasgupta S, McAllester D, eds. Proc. 30th Internat. Conf. on Machine Learn. (PMLR).Google Scholar
- [6] (2012) Dirichlet and multinomial distributions: Properties and uses in jags. Unite Mathematiques et Informatique Appliquees (INRA), 2012–2015.Google Scholar
- [7] (2009) A Bayesian sampling approach to exploration in reinforcement learning. Proc. 25th Conf. on Uncertainty in Artificial Intelligence (AUAI Press), 19–26.Google Scholar
- [8] (2017) Minimax regret bounds for reinforcement learning. Preprint, submitted July 1, https://arxiv.org/abs/1703.05449.Google Scholar
- [9] (2009) REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. Proc. 25th Conf. on Uncertainty in Artificial Intelligence (AUAI Press), 35–42.Google Scholar
- [10] (2002) R-max-a general polynomial time algorithm for near-optimal reinforcement learning. J. Machine Learn. Res. 3(Oct):213–231.Google Scholar
- [11] (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learn. 5(1):1–122.Crossref, Google Scholar
- [12] (2013) Prior-free and prior-dependent regret bounds for Thompson sampling. Adv. Neural Inform. Processing Systems 26:638–646.Google Scholar
- [13] (1997) Optimal adaptive policies for Markov decision processes. Math. Oper. Res. 22(1):222–255.Link, Google Scholar
- [14] (2011) An empirical evaluation of Thompson sampling. Adv. Neural Inf. Processing Systems 24:2249–2257.Google Scholar
- [15] (2015) Sample complexity of episodic fixed-horizon reinforcement learning. Adv. Neural Inform. Processing Systems 28:2818–2826.Google Scholar
- [16] (2013) An optimistic posterior sampling strategy for bayesian reinforcement learning. Proc. NIPS Workshop on Bayesian Optimization.Google Scholar
- [17] (2018) Near optimal exploration-exploitation in non-communicating markov decision processes. Adv. Neural Inform. Processing Systems 31:31.Google Scholar
- [18] (2020) Improved analysis of ucrl2 with empirical bernstein inequality. Preprint, submitted July 10, https://arxiv.org/abs/2007.05456.Google Scholar
- [19] (2012) Introduction to Probability (American Mathematical Society, Providence, RI).Google Scholar
- [20] (2010) Near-optimal regret bounds for reinforcement learning. J. Machine Learn. Res. 11(Apr):1563–1600.Google Scholar
- [21] (2018) Variance reduction methods for sublinear reinforcement learning. Preprint, submitted June 27, https://arxiv.org/abs/1802.09184.Google Scholar
- [22] (2003) On the sample complexity of reinforcement learning. PhD thesis, University of London, London.Google Scholar
- [23] (2012) Thompson sampling: An optimal finite time analysis. Proc. Internat. Conf. on Algorithmic Learn. Theory (Springer).Google Scholar
- [24] (1999) Finite-sample convergence rates for Q-learning and indirect algorithms. Adv. Neural Inform. Processing Systems 11:996–1002.Google Scholar
- [25] (2008) Multi-armed bandits in metric spaces. Proc. 40th Annual ACM Sympos. on Theory of Comput. (ACM, New York), 681–690.Google Scholar
- [26] (2020) Regret bounds for reinforcement learning via Markov chain concentration. J. Artificial Intelligence Res. 67:115–128.Crossref, Google Scholar
- [27] (2017) Why is posterior sampling better than optimism for reinforcement learning. Preprint, submitted June 13, https://arxiv.org/abs/1607.00215.Google Scholar
- [28] (2013) (More) efficient reinforcement learning via posterior sampling. Adv. Neural Inform. Processing Systems 26:3003–3011.Google Scholar
- [29] (2016) Generalization and exploration via randomized value functions. Preprint, submitted February 15, https://arxiv.org/abs/1402.0635.Google Scholar
- [30] (2017) Learning unknown Markov decision processes: A Thompson sampling approach. Preprint, submitted September 14, https://arxiv.org/abs/1709.04570.Google Scholar
- [31] (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, New York).Google Scholar
- [32] (2014) Learning to optimize via posterior sampling. Math. Oper. Res. 39(4):1221–1243.Link, Google Scholar
- [33] (2016) An information-theoretic analysis of Thompson sampling. J. Machine Learn. Res. 17(4):1–30.Google Scholar
- [34] (2012) PAC-Bayesian inequalities for martingales. IEEE Trans. Inform. Theory 58(12):7086–7093.Crossref, Google Scholar
- [35] (2010) An improvement of convergence rate estimates in the Lyapunov theorem. Doklady Math. 82(3):862–864.Crossref, Google Scholar
- [36] (2005) A theoretical analysis of model-based interval estimation. Proc. 22nd Internat. Conf. on Machine Learn. (ACM, New York), 856–863.Google Scholar
- [37] (2008) An analysis of model-based interval estimation for Markov decision processes. J. Comput. System Sci. 74(8):1309–1331.Crossref, Google Scholar
- [38] (2018) Variance-aware regret bounds for undiscounted reinforcement learning in mdps. Algorithmic Learning Theory (PMLR), 770–805.Google Scholar
- [39] (2008) Optimistic linear programming gives logarithmic regret for irreducible MDPs. Adv. Neural Inform. Processing Systems 20:1505–1512.Google Scholar
- [40] (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.Crossref, Google Scholar
- [41] (2019) Regret minimization for reinforcement learning by evaluating the optimal bias function. Preprint, submitted December 28, https://arxiv.org/abs/1906.05110.Google Scholar

