Regret Analysis of a Markov Policy Gradient Algorithm for Multiarm Bandits

Neil Walton
Corresponding Author
Neil Walton
[email protected]
https://orcid.org/0000-0002-5241-9765
Durham University Business School, Durham DH1 3LB, United Kingdom
Search for more papers by this author
,
Denis Denisov
Denis Denisov
[email protected]
https://orcid.org/0000-0003-0025-7140
Durham University Business School, Durham DH1 3LB, United Kingdom
Search for more papers by this author

Neil Walton

Corresponding Author

Neil Walton

[email protected]

https://orcid.org/0000-0002-5241-9765

Durham University Business School, Durham DH1 3LB, United Kingdom

Search for more papers by this author

Denis Denisov

[email protected]

https://orcid.org/0000-0003-0025-7140

Durham University Business School, Durham DH1 3LB, United Kingdom

Search for more papers by this author

Published Online:3 Oct 2022https://doi.org/10.1287/moor.2022.1311

References

[1] Abernethy JD, Hazan E, Rakhlin A (2009) Competing in the dark: An efficient algorithm for bandit linear optimization.Google Scholar
[2] Agarwal A, Kakade SM, Lee JD, Mahajan G (2020) Optimality and approximation with policy gradient methods in Markov decision processes. Abernethy J, Agarwal S, eds. Proc. 33rd Conf. Learn. Theory, vol. 125 (PMLR), 64–66.Google Scholar
[3] Agrawal R (1995) Sample mean based index policies by o(log n) regret for the multi-armed bandit problem. Adv. Appl. Probab. 27(4):1054–1078.Crossref, Google Scholar
[4] Agrawal S, Goyal N (2012) Analysis of Thompson sampling for the multi-armed bandit problem. Proc. 25th Annual Conf. Learn. Theory, vol. 23, 39.1–39.26.Google Scholar
[5] Anscombe F (1963) Sequential medical trials. J. Amer. Statist. Assoc. 58(302):365–383.Crossref, Google Scholar
[6] Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47(2–3):235–256.Crossref, Google Scholar
[7] Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (1995) Gambling in a rigged casino: The adversarial multi-armed bandit problem. Proc. IEEE 36th Annual Foundations Comput. Sci. (IEEE), 322–331.Google Scholar
[8] Babichev D, Bach F (2018) Constant step size stochastic gradient descent for probabilistic modeling. Proc. UAI 2018 Conf. Uncertainty Artificial Intelligence (Association for Uncertainty in Artificial Intelligence).Google Scholar
[9] Beck CL, Srikant R (2012) Error bounds for constant step-size q-learning. Systems Control Lett. 61(12):1203–1208.Crossref, Google Scholar
[10] Bertsimas D, Tsitsiklis J (1993) Simulated annealing. Statist. Sci. 8(1):10–15.Crossref, Google Scholar
[11] Bhandari J, Russo D (2019) Global optimality guarantees for policy gradient methods. Preprint, submitted June 5, https://arxiv.org/abs/1906.01786.Google Scholar
[12] Bingham NH, Goldie CM, Teugels JL (1987) Regular variation. Encyclopedia of Mathematics and its Applications (Cambridge University Press).Crossref, Google Scholar
[13] Bouneffouf D, Rish I, Aggarwal C (2020) Survey on applications of multi-armed and contextual bandits. 2020 IEEE Congress Evolutionary Comput. (CEC) (IEEE), 1–8.Google Scholar
[14] Bubeck S, Cesa-Bianchi N (2012) Regret analysis of Stochastic and Nonstochastic Multi-Armed Bandit Problems, Foundations and Trends in Machine Learning, vol. 5.Google Scholar
[15] Denisov D, Korshunov D, Wachtel V (2016) At the edge of criticality: Markov chains with asymptotically zero drift. Preprint, submitted December 5, https://arxiv.org/abs/1612.01592.Google Scholar
[16] Denisov D, Korshunov D, Wachtel V (2020) Renewal theory for transient Markov chains with asymptotically zero drift. Trans. Amer. Math. Soc. 373:7253–7286.Crossref, Google Scholar
[17] Dieuleveut A, Durmus A, Bach F (2020) Bridging the gap between constant step size stochastic gradient descent and Markov chains. Ann. Statist. 48(3):1348–1382.Google Scholar
[18] Dupuis P, Williams RJ (1994) Lyapunov functions for semimartingale reflecting brownian motions. Ann. Probab. 22(2):680–702.Crossref, Google Scholar
[19] Feller W (1971) An Introduction to Probability Theory and its Applications, Wiley Mathematical Statistics Series, vol. 2 (Wiley, New York).Google Scholar
[20] Gittins J, Glazebrook K, Weber R (2011) Multi-armed Bandit Allocation Indices (Wiley).Crossref, Google Scholar
[21] Hajek B (1986) Optimization by Simulated Annealing: A Necessary and Sufficient Condition for Convergence, Lecture Notes–Monograph Series, vol. 8 (Institute of Mathematical Statistics, Hayward, CA).Google Scholar
[22] Hazan E (2016) Introduction to Online Convex Optimization, Foundations and Trends in Optimization, vol. 2.Google Scholar
[23] Kaufmann E, Korda N, Munos R (2012) Thompson sampling: An asymptotically optimal finite-time analysis. Proc. Internat. Conf. Algorithmic Learning Theory (Springer, New York), 199–213.Google Scholar
[24] Kushner H (2013) Heavy Traffic Analysis of Controlled Queueing and Communication Networks, vol. 47 (Springer Science & Business Media).Google Scholar
[25] Lai TL, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.Crossref, Google Scholar
[26] Lamperti J (1960) Criteria for the recurrence or transience of stochastic process I. J. Math. Anal. Appl. 1(3):314–330.Crossref, Google Scholar
[27] Lamperti J (1963) Criteria for stochastic processes II: Passage-time moments. J. Math. Anal. Appl. 7(1):127–145.Crossref, Google Scholar
[28] Lattimore T, Szepesvári C (2020) Bandit Algorithms (Cambridge University Press, Cambridge, UK).Google Scholar
[29] Mei J, Xiao C, Szepesvari C, Schuurmans D (2020) On the global convergence rates of softmax policy gradient methods. Proc. Internat. Conf. Machine Learn. (PMLR), 6820–6829.Google Scholar
[30] Meyn SP, Tweedie RL (2012) Markov Chains and Stochastic Stability. (Springer Science & Business Media, New York).Google Scholar
[31] Seldin Y, Szepesvári C, Auer P, Abbasi-Yadkori Y (2012) Evaluation and analysis of the performance of the exp3 algorithm in stochastic environments. EWRL, 103–116.Google Scholar
[32] Srikant R, Ying L (2019) Finite-time error bounds for linear stochastic approximation and td learning. Proc. Conf. Learning Theory (PMLR), 2803–2830.Google Scholar
[33] Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
[34] Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.Crossref, Google Scholar
[35] Williams D (1991) Probability with Martingales, Cambridge Mathematical Textbooks (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
[36] Williams RJ (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learn. 8(3–4):229–256.Crossref, Google Scholar
[37] Zhang J, Kim J, O’Donoghue B, Boyd S (2021) Sample efficient reinforcement learning with REINFORCE. Proc. AAAI Conf. Artificial Intelligence 35(12):10887–10895.Google Scholar

cover image Mathematics of Operations Research

Volume 48, Issue 3

August 2023

Pages 1213-1809, C2

Article Information

Metrics

Information

Received:August 05, 2020
Accepted:August 17, 2022
Published Online:October 03, 2022

Cite as

Neil Walton, Denis Denisov (2022) Regret Analysis of a Markov Policy Gradient Algorithm for Multiarm Bandits. Mathematics of Operations Research 48(3):1553-1588.

https://doi.org/10.1287/moor.2022.1311

Keywords

Acknowledgments

Bandits is a new area for both authors, so they are grateful to Tor Lattimore for references, comments, and suggestions on the positioning of this work. They are grateful to the anonymous referee who suggested the average version of SAMBA considered in Theorem 3.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Regret Analysis of a Markov Policy Gradient Algorithm for Multiarm Bandits

References

Volume 48, Issue 3

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News