Explore First, Exploit Next: The True Shape of Regret in Bandit Problems

Published Online:https://doi.org/10.1287/moor.2017.0928

References

  • Ali SM, Silvey SD (1966) A general class of coefficients of divergence of one distribution from another. J. Roy. Statist. Soc. Ser. B. Methodological 28(1):131–142.Google Scholar
  • Auer P, Ortner R (2010) UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica 61(1):55–65.CrossrefGoogle Scholar
  • Auer P, Cesa-Bianchi N, Fischer P (2002a) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47(2–3):235–256.CrossrefGoogle Scholar
  • Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (2002b) The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1):48–77.CrossrefGoogle Scholar
  • Bubeck S (2010) Bandits games and clustering foundations. PhD thesis, Université Lille 1, Villeneuve-d’Ascq, France.Google Scholar
  • Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learn. 5(1):1–122.CrossrefGoogle Scholar
  • Bubeck S, Perchet V, Rigollet P (2013a) Bounded regret in stochastic multi-armed bandits. Shaley-Shwartz S, Steinwart I, eds. Proc. 26th Annual Conf. Learn. Theory (COLT), JMLR W&CP, Vol. 30, 122–134.Google Scholar
  • Bubeck S, Perchet V, Rigollet P (2013b) Erratum to Bubeck et al. [7]. http://sbubeck.com/pub.html, accessed July 13, 2018: “The proof of Theorem 8 is not correct. We do not know if the theorem holds true.”Google Scholar
  • Burnetas AN, Katehakis MN (1996) Optimal adaptive policies for sequential allocation problems. Adv. Appl. Math. 17(2):122–142.CrossrefGoogle Scholar
  • Calabro C (2009) The exponential complexity of satisfiability problems. PhD thesis, University of California, San Diego.Google Scholar
  • Cappé O, Garivier A, Maillard O-A, Munos R, Stoltz G (2013) Kullback–Leibler upper confidence bounds for optimal sequential allocation. Ann. Statist. 41(3):1516–1541.CrossrefGoogle Scholar
  • Cesa-Bianchi N, Lugosi G (2006) Prediction, Learning, and Games (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • Combes R, Proutière A (2014) Unimodal bandits without smoothness. arXiv preprint: ArXiv:1406.7447.Google Scholar
  • Cowan W, Katehakis MN (2015) Asymptotically optimal sequential experimentation under generalized ranking. arXiv preprint: ArXiv:1510.02041.Google Scholar
  • Faure M, Gaillard P, Gaujal B, Perchet V (2015) Online learning and game theory. A quick overview with recent results and applications. ESAIM: Proc. Surveys 51:246–271.CrossrefGoogle Scholar
  • Garivier A, Kaufmann E, Lattimore T (2016) On explore-then-commit strategies. Lee DD, Sugiyama M, von Luxburg U, Guyon I, Garnett R, eds. Advances in Neural Information Processing Systems 29 (NIPS 2016) (Curran Associates, Red Hook, NY), 784–792.Google Scholar
  • Honda J, Takemura A (2015) Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards. J. Machine Learn. Res. 16(December):3721–3756.Google Scholar
  • Jiang C (2015) Online advertisements and multi-armed bandits. PhD thesis, University of Illinois at Urbana–Champaign, Champaign.Google Scholar
  • Kaufmann E, Cappé O, Garivier A (2016) On the complexity of best arm identfication in multi-armed bandit models. J. Machine Learn. Res. 17(1):1–42.Google Scholar
  • Kulkarni S, Lugosi G (2000) Finite-time lower bounds for the two-armed bandit problem. IEEE Trans. Automatic Control 45(4):711–714.CrossrefGoogle Scholar
  • Lai TL, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.CrossrefGoogle Scholar
  • Lehmann EL, Casella G (1998) Theory of Point Estimation (Springer, New York).Google Scholar
  • Mannor S, Tsitsiklis JN (2004) The sample complexity of exploration in the multi-armed bandit problem. J. Machine Learn. Res.5(June):623–648.Google Scholar
  • Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.CrossrefGoogle Scholar
  • Wu Y, György A, Szepesvari C (2015) Online learning with Gaussian payoffs and side observations. Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R, eds. Advances in Neural Information Processing Systems 28 (NIPS 2015) (Curran Associates, Red Hook, NY), 1360–1368.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.