Regret Analysis of a Markov Policy Gradient Algorithm for Multiarm Bandits
References
- [1] (2009) Competing in the dark: An efficient algorithm for bandit linear optimization.Google Scholar
- [2] (2020) Optimality and approximation with policy gradient methods in Markov decision processes. Abernethy J, Agarwal S, eds. Proc. 33rd Conf. Learn. Theory, vol. 125 (PMLR), 64–66.Google Scholar
- [3] (1995) Sample mean based index policies by o(log n) regret for the multi-armed bandit problem. Adv. Appl. Probab. 27(4):1054–1078.Crossref, Google Scholar
- [4] (2012) Analysis of Thompson sampling for the multi-armed bandit problem. Proc. 25th Annual Conf. Learn. Theory, vol. 23, 39.1–39.26.Google Scholar
- [5] (1963) Sequential medical trials. J. Amer. Statist. Assoc. 58(302):365–383.Crossref, Google Scholar
- [6] (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47(2–3):235–256.Crossref, Google Scholar
- [7] (1995) Gambling in a rigged casino: The adversarial multi-armed bandit problem. Proc. IEEE 36th Annual Foundations Comput. Sci. (IEEE), 322–331.Google Scholar
- [8] (2018) Constant step size stochastic gradient descent for probabilistic modeling. Proc. UAI 2018 Conf. Uncertainty Artificial Intelligence (Association for Uncertainty in Artificial Intelligence).Google Scholar
- [9] (2012) Error bounds for constant step-size q-learning. Systems Control Lett. 61(12):1203–1208.Crossref, Google Scholar
- [10] (1993) Simulated annealing. Statist. Sci. 8(1):10–15.Crossref, Google Scholar
- [11] (2019) Global optimality guarantees for policy gradient methods. Preprint, submitted June 5, https://arxiv.org/abs/1906.01786.Google Scholar
- [12] (1987) Regular variation. Encyclopedia of Mathematics and its Applications (Cambridge University Press).Crossref, Google Scholar
- [13] , Aggarwal C (2020) Survey on applications of multi-armed and contextual bandits. 2020 IEEE Congress Evolutionary Comput. (CEC) (IEEE), 1–8.Google Scholar
- [14] (2012) Regret analysis of Stochastic and Nonstochastic Multi-Armed Bandit Problems, Foundations and Trends in Machine Learning, vol. 5.Google Scholar
- [15] (2016) At the edge of criticality: Markov chains with asymptotically zero drift. Preprint, submitted December 5, https://arxiv.org/abs/1612.01592.Google Scholar
- [16] (2020) Renewal theory for transient Markov chains with asymptotically zero drift. Trans. Amer. Math. Soc. 373:7253–7286.Crossref, Google Scholar
- [17] (2020) Bridging the gap between constant step size stochastic gradient descent and Markov chains. Ann. Statist. 48(3):1348–1382.Google Scholar
- [18] (1994) Lyapunov functions for semimartingale reflecting brownian motions. Ann. Probab. 22(2):680–702.Crossref, Google Scholar
- [19] (1971) An Introduction to Probability Theory and its Applications, Wiley Mathematical Statistics Series, vol. 2 (Wiley, New York).Google Scholar
- [20] (2011) Multi-armed Bandit Allocation Indices (Wiley).Crossref, Google Scholar
- [21] (1986) Optimization by Simulated Annealing: A Necessary and Sufficient Condition for Convergence, Lecture Notes–Monograph Series, vol. 8 (Institute of Mathematical Statistics, Hayward, CA).Google Scholar
- [22] (2016) Introduction to Online Convex Optimization, Foundations and Trends in Optimization, vol. 2.Google Scholar
- [23] (2012) Thompson sampling: An asymptotically optimal finite-time analysis. Proc. Internat. Conf. Algorithmic Learning Theory (Springer, New York), 199–213.Google Scholar
- [24] (2013) Heavy Traffic Analysis of Controlled Queueing and Communication Networks, vol. 47 (Springer Science & Business Media).Google Scholar
- [25] (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.Crossref, Google Scholar
- [26] (1960) Criteria for the recurrence or transience of stochastic process I. J. Math. Anal. Appl. 1(3):314–330.Crossref, Google Scholar
- [27] (1963) Criteria for stochastic processes II: Passage-time moments. J. Math. Anal. Appl. 7(1):127–145.Crossref, Google Scholar
- [28] (2020) Bandit Algorithms (Cambridge University Press, Cambridge, UK).Google Scholar
- [29] (2020) On the global convergence rates of softmax policy gradient methods. Proc. Internat. Conf. Machine Learn. (PMLR), 6820–6829.Google Scholar
- [30] (2012) Markov Chains and Stochastic Stability. (Springer Science & Business Media, New York).Google Scholar
- [31] (2012) Evaluation and analysis of the performance of the exp3 algorithm in stochastic environments. EWRL, 103–116.Google Scholar
- [32] (2019) Finite-time error bounds for linear stochastic approximation and td learning. Proc. Conf. Learning Theory (PMLR), 2803–2830.Google Scholar
- [33] (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
- [34] (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.Crossref, Google Scholar
- [35] (1991) Probability with Martingales, Cambridge Mathematical Textbooks (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
- [36] (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learn. 8(3–4):229–256.Crossref, Google Scholar
- [37] (2021) Sample efficient reinforcement learning with REINFORCE. Proc. AAAI Conf. Artificial Intelligence 35(12):10887–10895.Google Scholar

