Regret Analysis of a Markov Policy Gradient Algorithm for Multiarm Bandits

Neil Walton
Corresponding Author
Neil Walton
[email protected]
https://orcid.org/0000-0002-5241-9765
Durham University Business School, Durham DH1 3LB, United Kingdom
Search for more papers by this author
,
Denis Denisov
Denis Denisov
[email protected]
https://orcid.org/0000-0003-0025-7140
Durham University Business School, Durham DH1 3LB, United Kingdom
Search for more papers by this author

Neil Walton

Corresponding Author

Neil Walton

[email protected]

https://orcid.org/0000-0002-5241-9765

Durham University Business School, Durham DH1 3LB, United Kingdom

Search for more papers by this author

Denis Denisov

[email protected]

https://orcid.org/0000-0003-0025-7140

Durham University Business School, Durham DH1 3LB, United Kingdom

Search for more papers by this author

Published Online:3 Oct 2022https://doi.org/10.1287/moor.2022.1311

Abstract

We consider a policy gradient algorithm applied to a finite-arm bandit problem with Bernoulli rewards. We allow learning rates to depend on the current state of the algorithm rather than using a deterministic time-decreasing learning rate. The state of the algorithm forms a Markov chain on the probability simplex. We apply Foster–Lyapunov techniques to analyze the stability of this Markov chain. We prove that, if learning rates are well-chosen, then the policy gradient algorithm is a transient Markov chain, and the state of the chain converges on the optimal arm with logarithmic or polylogarithmic regret.

cover image Mathematics of Operations Research

Volume 48, Issue 3

August 2023

Pages 1213-1809, C2

Article Information

Metrics

Information

Received:August 05, 2020
Accepted:August 17, 2022
Published Online:October 03, 2022

Cite as

Neil Walton, Denis Denisov (2022) Regret Analysis of a Markov Policy Gradient Algorithm for Multiarm Bandits. Mathematics of Operations Research 48(3):1553-1588.

https://doi.org/10.1287/moor.2022.1311

Keywords

Acknowledgments

Bandits is a new area for both authors, so they are grateful to Tor Lattimore for references, comments, and suggestions on the positioning of this work. They are grateful to the anonymous referee who suggested the average version of SAMBA considered in Theorem 3.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Regret Analysis of a Markov Policy Gradient Algorithm for Multiarm Bandits

Abstract

Volume 48, Issue 3

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News