Explore First, Exploit Next: The True Shape of Regret in Bandit Problems

Aurélien Garivier
Aurélien Garivier
Institute de Mathématiques de Toulouse (IMT): Université Paul Sabatier—The French National Research Center (CNRS), 31062 Toulouse, France
Search for more papers by this author
,
Pierre Ménard
Pierre Ménard
Institute de Mathématiques de Toulouse (IMT): Université Paul Sabatier—The French National Research Center (CNRS), 31062 Toulouse, France
Search for more papers by this author
,
Gilles Stoltz
Corresponding Author
Gilles Stoltz
http://orcid.org/0000-0003-1240-1007
HEC Paris Management Research Group (GREGHEC): HEC Paris—CNRS, 78351 Jouy-en-Josas, France
Search for more papers by this author

Institute de Mathématiques de Toulouse (IMT): Université Paul Sabatier—The French National Research Center (CNRS), 31062 Toulouse, France

Search for more papers by this author

Pierre Ménard

Institute de Mathématiques de Toulouse (IMT): Université Paul Sabatier—The French National Research Center (CNRS), 31062 Toulouse, France

Search for more papers by this author

Gilles Stoltz

Corresponding Author

Gilles Stoltz

http://orcid.org/0000-0003-1240-1007

HEC Paris Management Research Group (GREGHEC): HEC Paris—CNRS, 78351 Jouy-en-Josas, France

Search for more papers by this author

Published Online:20 Sep 2018https://doi.org/10.1287/moor.2017.0928

References

Ali SM, Silvey SD (1966) A general class of coefficients of divergence of one distribution from another. J. Roy. Statist. Soc. Ser. B. Methodological 28(1):131–142.Google Scholar
Auer P, Ortner R (2010) UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica 61(1):55–65.Crossref, Google Scholar
Auer P, Cesa-Bianchi N, Fischer P (2002a) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47(2–3):235–256.Crossref, Google Scholar
Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (2002b) The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1):48–77.Crossref, Google Scholar
Bubeck S (2010) Bandits games and clustering foundations. PhD thesis, Université Lille 1, Villeneuve-d’Ascq, France.Google Scholar
Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learn. 5(1):1–122.Crossref, Google Scholar
Bubeck S, Perchet V, Rigollet P (2013a) Bounded regret in stochastic multi-armed bandits. Shaley-Shwartz S, Steinwart I, eds. Proc. 26th Annual Conf. Learn. Theory (COLT), JMLR W&CP, Vol. 30, 122–134.Google Scholar
Bubeck S, Perchet V, Rigollet P (2013b) Erratum to Bubeck et al. [7]. http://sbubeck.com/pub.html, accessed July 13, 2018: “The proof of Theorem 8 is not correct. We do not know if the theorem holds true.”Google Scholar
Burnetas AN, Katehakis MN (1996) Optimal adaptive policies for sequential allocation problems. Adv. Appl. Math. 17(2):122–142.Crossref, Google Scholar
Calabro C (2009) The exponential complexity of satisfiability problems. PhD thesis, University of California, San Diego.Google Scholar
Cappé O, Garivier A, Maillard O-A, Munos R, Stoltz G (2013) Kullback–Leibler upper confidence bounds for optimal sequential allocation. Ann. Statist. 41(3):1516–1541.Crossref, Google Scholar
Cesa-Bianchi N, Lugosi G (2006) Prediction, Learning, and Games (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
Combes R, Proutière A (2014) Unimodal bandits without smoothness. arXiv preprint: ArXiv:1406.7447.Google Scholar
Cowan W, Katehakis MN (2015) Asymptotically optimal sequential experimentation under generalized ranking. arXiv preprint: ArXiv:1510.02041.Google Scholar
Faure M, Gaillard P, Gaujal B, Perchet V (2015) Online learning and game theory. A quick overview with recent results and applications. ESAIM: Proc. Surveys 51:246–271.Crossref, Google Scholar
Garivier A, Kaufmann E, Lattimore T (2016) On explore-then-commit strategies. Lee DD, Sugiyama M, von Luxburg U, Guyon I, Garnett R, eds. Advances in Neural Information Processing Systems 29 (NIPS 2016) (Curran Associates, Red Hook, NY), 784–792.Google Scholar
Honda J, Takemura A (2015) Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards. J. Machine Learn. Res. 16(December):3721–3756.Google Scholar
Jiang C (2015) Online advertisements and multi-armed bandits. PhD thesis, University of Illinois at Urbana–Champaign, Champaign.Google Scholar
Kaufmann E, Cappé O, Garivier A (2016) On the complexity of best arm identfication in multi-armed bandit models. J. Machine Learn. Res. 17(1):1–42.Google Scholar
Kulkarni S, Lugosi G (2000) Finite-time lower bounds for the two-armed bandit problem. IEEE Trans. Automatic Control 45(4):711–714.Crossref, Google Scholar
Lai TL, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.Crossref, Google Scholar
Lehmann EL, Casella G (1998) Theory of Point Estimation (Springer, New York).Google Scholar
Mannor S, Tsitsiklis JN (2004) The sample complexity of exploration in the multi-armed bandit problem. J. Machine Learn. Res.5(June):623–648.Google Scholar
Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.Crossref, Google Scholar
Wu Y, György A, Szepesvari C (2015) Online learning with Gaussian payoffs and side observations. Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R, eds. Advances in Neural Information Processing Systems 28 (NIPS 2015) (Curran Associates, Red Hook, NY), 1360–1368.Google Scholar

cover image Mathematics of Operations Research

Volume 44, Issue 2

May 2019

Pages 377-766, C2

Article Information

Metrics

Information

Received:June 10, 2016
Accepted:December 08, 2017
Published Online:September 20, 2018

Cite as

Aurélien Garivier, Pierre Ménard, Gilles Stoltz (2018) Explore First, Exploit Next: The True Shape of Regret in Bandit Problems. Mathematics of Operations Research 44(2):377-399.

https://doi.org/10.1287/moor.2017.0928

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Explore First, Exploit Next: The True Shape of Regret in Bandit Problems

References

Volume 44, Issue 2

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News