Explore First, Exploit Next: The True Shape of Regret in Bandit Problems
Published Online:20 Sep 2018https://doi.org/10.1287/moor.2017.0928
References
- (1966) A general class of coefficients of divergence of one distribution from another. J. Roy. Statist. Soc. Ser. B. Methodological 28(1):131–142.Google Scholar
- (2010) UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica 61(1):55–65.Crossref, Google Scholar
- (2002a) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47(2–3):235–256.Crossref, Google Scholar
- (2002b) The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1):48–77.Crossref, Google Scholar
- (2010) Bandits games and clustering foundations. PhD thesis, Université Lille 1, Villeneuve-d’Ascq, France.Google Scholar
- (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learn. 5(1):1–122.Crossref, Google Scholar
- (2013a) Bounded regret in stochastic multi-armed bandits. Shaley-Shwartz S, Steinwart I, eds. Proc. 26th Annual Conf. Learn. Theory (COLT), JMLR W&CP, Vol. 30, 122–134.Google Scholar
- (2013b) Erratum to Bubeck et al. [7]. http://sbubeck.com/pub.html, accessed July 13, 2018: “The proof of Theorem 8 is not correct. We do not know if the theorem holds true.”Google Scholar
- (1996) Optimal adaptive policies for sequential allocation problems. Adv. Appl. Math. 17(2):122–142.Crossref, Google Scholar
- (2009) The exponential complexity of satisfiability problems. PhD thesis, University of California, San Diego.Google Scholar
- (2013) Kullback–Leibler upper confidence bounds for optimal sequential allocation. Ann. Statist. 41(3):1516–1541.Crossref, Google Scholar
- (2006) Prediction, Learning, and Games (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
- (2014) Unimodal bandits without smoothness. arXiv preprint: ArXiv:1406.7447.Google Scholar
- (2015) Asymptotically optimal sequential experimentation under generalized ranking. arXiv preprint: ArXiv:1510.02041.Google Scholar
- (2015) Online learning and game theory. A quick overview with recent results and applications. ESAIM: Proc. Surveys 51:246–271.Crossref, Google Scholar
- (2016) On explore-then-commit strategies. Lee DD, Sugiyama M, von Luxburg U, Guyon I, Garnett R, eds. Advances in Neural Information Processing Systems 29 (NIPS 2016) (Curran Associates, Red Hook, NY), 784–792.Google Scholar
- (2015) Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards. J. Machine Learn. Res. 16(December):3721–3756.Google Scholar
- (2015) Online advertisements and multi-armed bandits. PhD thesis, University of Illinois at Urbana–Champaign, Champaign.Google Scholar
- (2016) On the complexity of best arm identfication in multi-armed bandit models. J. Machine Learn. Res. 17(1):1–42.Google Scholar
- (2000) Finite-time lower bounds for the two-armed bandit problem. IEEE Trans. Automatic Control 45(4):711–714.Crossref, Google Scholar
- (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.Crossref, Google Scholar
- (1998) Theory of Point Estimation (Springer, New York).Google Scholar
- (2004) The sample complexity of exploration in the multi-armed bandit problem. J. Machine Learn. Res.5(June):623–648.Google Scholar
- (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.Crossref, Google Scholar
- (2015) Online learning with Gaussian payoffs and side observations. Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R, eds. Advances in Neural Information Processing Systems 28 (NIPS 2015) (Curran Associates, Red Hook, NY), 1360–1368.Google Scholar

