Regret Analysis of a Markov Policy Gradient Algorithm for Multiarm Bandits

Published Online:https://doi.org/10.1287/moor.2022.1311

References

  • [1] Abernethy JD, Hazan E, Rakhlin A (2009) Competing in the dark: An efficient algorithm for bandit linear optimization.Google Scholar
  • [2] Agarwal A, Kakade SM, Lee JD, Mahajan G (2020) Optimality and approximation with policy gradient methods in Markov decision processes. Abernethy J, Agarwal S, eds. Proc. 33rd Conf. Learn. Theory, vol. 125 (PMLR), 64–66.Google Scholar
  • [3] Agrawal R (1995) Sample mean based index policies by o(log n) regret for the multi-armed bandit problem. Adv. Appl. Probab. 27(4):1054–1078.CrossrefGoogle Scholar
  • [4] Agrawal S, Goyal N (2012) Analysis of Thompson sampling for the multi-armed bandit problem. Proc. 25th Annual Conf. Learn. Theory, vol. 23, 39.1–39.26.Google Scholar
  • [5] Anscombe F (1963) Sequential medical trials. J. Amer. Statist. Assoc. 58(302):365–383.CrossrefGoogle Scholar
  • [6] Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47(2–3):235–256.CrossrefGoogle Scholar
  • [7] Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (1995) Gambling in a rigged casino: The adversarial multi-armed bandit problem. Proc. IEEE 36th Annual Foundations Comput. Sci. (IEEE), 322–331.Google Scholar
  • [8] Babichev D, Bach F (2018) Constant step size stochastic gradient descent for probabilistic modeling. Proc. UAI 2018 Conf. Uncertainty Artificial Intelligence (Association for Uncertainty in Artificial Intelligence).Google Scholar
  • [9] Beck CL, Srikant R (2012) Error bounds for constant step-size q-learning. Systems Control Lett. 61(12):1203–1208.CrossrefGoogle Scholar
  • [10] Bertsimas D, Tsitsiklis J (1993) Simulated annealing. Statist. Sci. 8(1):10–15.CrossrefGoogle Scholar
  • [11] Bhandari J, Russo D (2019) Global optimality guarantees for policy gradient methods. Preprint, submitted June 5, https://arxiv.org/abs/1906.01786.Google Scholar
  • [12] Bingham NH, Goldie CM, Teugels JL (1987) Regular variation. Encyclopedia of Mathematics and its Applications (Cambridge University Press).CrossrefGoogle Scholar
  • [13] Bouneffouf D, Rish I, Aggarwal C (2020) Survey on applications of multi-armed and contextual bandits. 2020 IEEE Congress Evolutionary Comput. (CEC) (IEEE), 1–8.Google Scholar
  • [14] Bubeck S, Cesa-Bianchi N (2012) Regret analysis of Stochastic and Nonstochastic Multi-Armed Bandit Problems, Foundations and Trends in Machine Learning, vol. 5.Google Scholar
  • [15] Denisov D, Korshunov D, Wachtel V (2016) At the edge of criticality: Markov chains with asymptotically zero drift. Preprint, submitted December 5, https://arxiv.org/abs/1612.01592.Google Scholar
  • [16] Denisov D, Korshunov D, Wachtel V (2020) Renewal theory for transient Markov chains with asymptotically zero drift. Trans. Amer. Math. Soc. 373:7253–7286.CrossrefGoogle Scholar
  • [17] Dieuleveut A, Durmus A, Bach F (2020) Bridging the gap between constant step size stochastic gradient descent and Markov chains. Ann. Statist. 48(3):1348–1382.Google Scholar
  • [18] Dupuis P, Williams RJ (1994) Lyapunov functions for semimartingale reflecting brownian motions. Ann. Probab. 22(2):680–702.CrossrefGoogle Scholar
  • [19] Feller W (1971) An Introduction to Probability Theory and its Applications, Wiley Mathematical Statistics Series, vol. 2 (Wiley, New York).Google Scholar
  • [20] Gittins J, Glazebrook K, Weber R (2011) Multi-armed Bandit Allocation Indices (Wiley).CrossrefGoogle Scholar
  • [21] Hajek B (1986) Optimization by Simulated Annealing: A Necessary and Sufficient Condition for Convergence, Lecture Notes–Monograph Series, vol. 8 (Institute of Mathematical Statistics, Hayward, CA).Google Scholar
  • [22] Hazan E (2016) Introduction to Online Convex Optimization, Foundations and Trends in Optimization, vol. 2.Google Scholar
  • [23] Kaufmann E, Korda N, Munos R (2012) Thompson sampling: An asymptotically optimal finite-time analysis. Proc. Internat. Conf. Algorithmic Learning Theory (Springer, New York), 199–213.Google Scholar
  • [24] Kushner H (2013) Heavy Traffic Analysis of Controlled Queueing and Communication Networks, vol. 47 (Springer Science & Business Media).Google Scholar
  • [25] Lai TL, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.CrossrefGoogle Scholar
  • [26] Lamperti J (1960) Criteria for the recurrence or transience of stochastic process I. J. Math. Anal. Appl. 1(3):314–330.CrossrefGoogle Scholar
  • [27] Lamperti J (1963) Criteria for stochastic processes II: Passage-time moments. J. Math. Anal. Appl. 7(1):127–145.CrossrefGoogle Scholar
  • [28] Lattimore T, Szepesvári C (2020) Bandit Algorithms (Cambridge University Press, Cambridge, UK).Google Scholar
  • [29] Mei J, Xiao C, Szepesvari C, Schuurmans D (2020) On the global convergence rates of softmax policy gradient methods. Proc. Internat. Conf. Machine Learn. (PMLR), 6820–6829.Google Scholar
  • [30] Meyn SP, Tweedie RL (2012) Markov Chains and Stochastic Stability. (Springer Science & Business Media, New York).Google Scholar
  • [31] Seldin Y, Szepesvári C, Auer P, Abbasi-Yadkori Y (2012) Evaluation and analysis of the performance of the exp3 algorithm in stochastic environments. EWRL, 103–116.Google Scholar
  • [32] Srikant R, Ying L (2019) Finite-time error bounds for linear stochastic approximation and td learning. Proc. Conf. Learning Theory (PMLR), 2803–2830.Google Scholar
  • [33] Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
  • [34] Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.CrossrefGoogle Scholar
  • [35] Williams D (1991) Probability with Martingales, Cambridge Mathematical Textbooks (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • [36] Williams RJ (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learn. 8(3–4):229–256.CrossrefGoogle Scholar
  • [37] Zhang J, Kim J, O’Donoghue B, Boyd S (2021) Sample efficient reinforcement learning with REINFORCE. Proc. AAAI Conf. Artificial Intelligence 35(12):10887–10895.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.