Simple Bayesian Algorithms for Best-Arm Identification

Daniel Russo
Daniel Russo
[email protected]
http://orcid.org/0000-0001-5926-8624
Columbia University, New York, New York 10027
Search for more papers by this author

Columbia University, New York, New York 10027

Search for more papers by this author

Published Online:16 Apr 2020https://doi.org/10.1287/opre.2019.1911

References

Agrawal S, Goyal N (2012) Analysis of Thompson sampling for the multi-armed bandit problem. Mannor S, Srebro N, Williamson RC, eds. Proc. 21st Annual Conf. Learning Theory, Proceedings of Machine Learning Research, vol. 23 (PMLR), 39.1–39.26.Google Scholar
Albert AE (1961) The sequential design of experiments for infinitely many states of nature. Ann. Math. Statist. 32(3):774–799.Crossref, Google Scholar
Audibert JY, Bubeck S, Munos R (2010) Best arm identification in multi-armed bandits. Kalai AT, Mohri M, eds. COLT 23rd Conf. Learning Theory (Omnipress, Madison, WI), 41–53.Google Scholar
Barron A, Schervish MJ, Wasserman L (1999) The consistency of posterior distributions in nonparametric problems. Ann. Statist. 27(2):536–561.Crossref, Google Scholar
Bechhofer RE, Sobel M(1954) A single-sample multiple decision procedure for ranking means of normal populations with known variances. Ann. Math. Statist. 25(2):16–39.Crossref, Google Scholar
Berry DA (2004) Bayesian statistics and the efficiency and ethics of clinical trials. Statist. Sci. 19(1):175–187.Crossref, Google Scholar
Bubeck S, Munos R, Stoltz G (2009) Pure exploration in multi-armed bandits problems. Algorithmic Learning Theory, Lecture Notes in Computer Science, vol. 5809 (Springer, Berlin), 23–37.Google Scholar
Chapelle O, Li L (2011) An empirical evaluation of Thompson sampling. Shawe-Taylor J, Zemel RS, Bartlett PL, Pereira F, Weinberger KQ, eds. Proc. Adv. Neural Inform. Processing Systems NIPS 2011 (Curran Associates, Red Hook, NY), 2249–2257.Google Scholar
Chen CH, Chick SE, Lee LH, Pujowidianto NA (2015) Ranking and selection: Efficient simulation budget allocation. Fu MC, ed. Handbook of Simulation Optimization, International Series in Operations Research & Management Science, vol. 216 (Springer, New York), 45–80.Google Scholar
Chen CH, He D, Fu M, Lee LH (2008) Efficient simulation budget allocation for selecting an optimal subset. INFORMS J. Comput. 20(4):579–595.Link, Google Scholar
Chen CH, Lin J, Yücesan E, Chick SE (2000) Simulation budget allocation for further enhancing the efficiency of ordinal optimization. Discrete Event Dynam. Systems 10(3):251–270.Crossref, Google Scholar
Chernoff H (1959) Sequential design of experiments. Ann. Math. Statist. 30(3):755–770.Crossref, Google Scholar
Chernoff H (1975) Approaches in sequential design of experiments. Srivastava JN, ed. A Survey of Statistical Design and Linear Models (North-Holland, Amsterdam), 67–90.Google Scholar
Chick SE, Frazier P (2012) Sequential sampling with economics of selection procedures. Management Sci. 58(3):550–569.Link, Google Scholar
Chick SE, Gans N (2009) Economic analysis of simulation selection problems. Management Sci. 55(3):421–437.Link, Google Scholar
Chick SE, Inoue K (2001) New two-stage and sequential procedures for selecting the best simulated system. Oper. Res. 49(5):732–743.Link, Google Scholar
Chick SE, Branke J, Schmidt C (2010) Sequential sampling to myopically maximize the expected value of information. INFORMS J. Comput. 22(1):71–80.Link, Google Scholar
Diaconis P, Freedman D (1986) On the consistency of Bayes estimates. Ann. Statist. 14(1):1–26.Crossref, Google Scholar
Even-Dar E, Mannor S, Mansour Y (2002) PAC bounds for multi-armed bandit and Markov decision processes. Kivinen J, Sloan RH, eds. COLT ‘02 Proc. 15th Annual Conf. Comput. Learn. Theory, Lecture Notes in Computer Science, vol. 2375 (Springer, Berlin), 255–270.Google Scholar
Fan W, Hong LJ, Nelson BL (2016) Indifference-zone-free selection of the best. Oper. Res. 64(6):1499–1514.Link, Google Scholar
Ferreira KJ, Simchi-Levi D, Wang H (2018) Online network revenue management using Thompson sampling. Oper. Res. 66(6):1586–1602.Link, Google Scholar
Frazier PI (2014) A fully sequential elimination procedure for indifference-zone ranking and selection with tight bounds on probability of correct selection. Oper. Res. 62(4):926–942.Link, Google Scholar
Frazier PI, Powell WB, Dayanik S (2008) A knowledge-gradient policy for sequential information collection. SIAM J. Control Optim. 47(5):2410–2439.Crossref, Google Scholar
Freedman DA (1963) On the asymptotic behavior of Bayes’ estimates in the discrete case. Ann. Math. Statist. 34(4):1386–1403.Crossref, Google Scholar
Gabillon V, Ghavamzadeh M, Lazaric A (2012) Best arm identification: A unified approach to fixed budget and fixed confidence. Pereira F, Burges CJC, Bottou L, Weinberger KQ, eds. NIPS’12 Proc. 25th Internat. Conf. Neural Inform. Processing Systems, vol. 2 (Curran Associates, Red Hook, NY), 3212–3220.Google Scholar
Garivier A, Kaufmann E (2016) Optimal best arm identification with fixed confidence. Feldman V, Rakhlin A, Shamir O, eds. Proc. Conf. Learn. Theory (COLT) 2016, Proceedings of Machine Learning Research, vol. 49 (PMLR), 1028–1050.Google Scholar
Ghosal S, Ghosh JK, Van Der Vaart AW (2000) Convergence rates of posterior distributions. Ann. Statist. 28(2):500–531.Crossref, Google Scholar
Gittins JC (1979) Bandit processes and dynamic allocation indices. J. Roy. Statist. Soc. B. 41(2):148–164.Crossref, Google Scholar
Gittins JC, Jones DM (1974) A dynamic allocation index for the sequential design of experiments. Gani J, ed., Progress in Statistics (North-Holland, Amsterdam), 241–266.Google Scholar
Glynn P, Juneja S (2004) A large deviations perspective on ordinal optimization. Ingalls RG, Rossetti MD, Smith JS, Peters BA, eds. Proc. 2004 Winter Simulation Conf. 2004, vol. 1 (IEEE, Piscataway, NJ).Google Scholar
Glynn P, Juneja S (2015) Selecting the best system and multi-armed bandits. Preprint, submitted July 16, https://arxiv.org/abs/1507.04564.Google Scholar
Gopalan A, Mannor S, Mansour Y (2014) Thompson sampling for complex online problems. Xing EP, Jebara T, eds. ICML 14 Proc. 31st Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 32 (PMLR), 100–108.Google Scholar
Graepel T, Candela JQ, Borchert T, Herbrich R (2010) Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoft’s Bing search engine. Fürnkranz J, Joachims T, eds. Proc. 27th Internat. Conf. Machine Learn. (ICML-10) (Omnipress, Madison, WI), 13–20.Google Scholar
Gupta SS, Miescke KJ (1996) Bayesian look ahead one-stage sampling allocations for selection of the best population. J. Statist. Planning Inference 54(2):229–244.Crossref, Google Scholar
Gutin E, Farias VF (2016) Optimistic Gittins indices. Lee DD, von Luxburg U, Garnett R, Sugiyama M, Guyon I, eds. NIPS’16 Proc. 30th Internat. Conf. Neural Inform. Processing Systems (Curran Associates, Red Hook, NY), 3161–3169.Google Scholar
Hong LJ, Nelson BL, Xu J (2015) Discrete optimization via simulation. Fu MC, ed. Handbook of Simulation Optimization, International Series in Operations Research and Management Science, vol. 216 (Springer, New York), 9–44.Google Scholar
Hunter S, Pasupathy R (2013) Optimal sampling laws for stochastically constrained simulation optimization on finite sets. INFORMS J. Comput. 25(3):527–542.Link, Google Scholar
Jamieson K, Nowak R (2014) Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. 48th Annual Conf. Inform. Sci. Systems (CISS) (IEEE, Piscataway, NJ), 1–6.Google Scholar
Jennison C, Johnstone IM, Turnbull BW (1982) Asymptotically optimal procedures for sequential adaptive selection of the best of several normal means. Gupta SS, Berger JO, eds. Statistical Decision Theory and Related Topics III (Academic Press, New York), 55–86.Google Scholar
Karnin Z, Koren T, Somekh O (2013) Almost optimal exploration in multi-armed bandits. Dasgupta S, McAllester D, eds. Proc. 30th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 28 (PMLR), 1238–1246.Google Scholar
Kaufmann E (2018) On Bayesian index policies for sequential resource allocation. Ann. Statist. 46(2):842–865.Crossref, Google Scholar
Kaufmann E, Kalyanakrishnan S (2013) Information complexity in bandit subset selection. J. Machine Learn. Res. 30:228–251.Google Scholar
Kaufmann E, Koolen W (2018) Mixture Martingales revisited with applications to sequential tests and confidence intervals. Preprint, submitted November 28, https://arxiv.org/abs/1811.11419. Google Scholar
Kaufmann E, Cappé O, Garivier A (2012) On Bayesian upper confidence bounds for bandit problems. Lawrence ND, Girolami MA, eds. Proc. 15th Internat. Conf. Artificial Intelligence Statist., Proceedings of Machine Learning Research, vol. 22 (PMLR), 592–600.Google Scholar
Kaufmann E, Cappé O, Garivier A (2014) On the complexity of best-arm identification in multi-armed bandit models. J. Machine Learning Res. 17.1(2016):1–42.Google Scholar
Kaufmann E, Korda N, Munos R (2012) Thompson sampling: An asymptotically optimal finite time analysis. Bshouty NH, Stoltz G, Vayatis N, Zeugmann T, eds. Internat. Conf. Algorithmic Learn. Theory, Lecture Notes in Computer Science, vol. 7568 (Springer, Berlin), 199–213.Google Scholar
Keener R (1984) Second order efficiency in the sequential design of experiments. Ann. Statist. 12(2):510–532.Crossref, Google Scholar
Kiefer J, Sacks J (1963) Asymptotically optimum sequential inference and design. Ann. Math. Statist. 34(3):705–750.Crossref, Google Scholar
Kim S-H, Nelson BL (2006) Selecting the best system. Henderson SG, Nelson BL, eds. Simulation, Handbooks in Operations Research and Management Science, vol. 13 (Elsevier, Amsterdam), 501–534.Google Scholar
Kim S-H, Nelson BL (2007) Recent advances in ranking and selection. Henderson S, Biller B, Hsieh M-h, Shortle J, eds. Proc. 39th Conf. Winter Simulation: 40 Years! Best Is Yet to Come (IEEE, Piscataway, NJ), 162–172.Google Scholar
Korda N, Kaufmann E, Munos R (2013) Thompson sampling for one-dimensional exponential family bandits. Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, eds. NIPS’13 Proc. 26th Internat. Conf. Neural Inform. Processing Systems, vol. 1 (Curran Associates, Red Hook, NY), 1448–1456.Google Scholar
Lai T, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.Crossref, Google Scholar
Mannor S, Tsitsiklis JN (2004) The sample complexity of exploration in the multi-armed bandit problem. J. Machine Learn. Res. 5:623–648.Google Scholar
Naghshvar M, Javidi T (2013) Active sequential hypothesis testing. Ann. Statist. 41(6):2703–2738.Crossref, Google Scholar
Ni EC, Ciocan DF, Henderson SG, Hunter SR (2017) Efficient ranking and selection in parallel computing environments. Oper. Res. 65(3):821–836.Link, Google Scholar
Nitinawarat S, Atia GK, Veeravalli VV (2013) Controlled sensing for multihypothesis testing. IEEE Trans. Automat. Control. 58(10):2451–2464.Crossref, Google Scholar
Pasupathy R, Hunter SR, Pujowidianto NA, Lee LH, Chen C (2015) Stochastically constrained ranking and selection via score. ACM Trans. Model. Comput. Simul. (TOMACS) 25(1):Article 1.Crossref, Google Scholar
Paulson E (1964) A sequential procedure for selecting the population with the largest mean from k normal populations. Ann. Math. Statist. 35(1):174–180.Crossref, Google Scholar
Qin C, Klabjan D, Russo D (2017) Improving the expected improvement algorithm. von Luxburg U, Guyon I, Bengio S, Wallach H, Fergus R, eds. NIPS’17 Proc. 31st Internat. Conf. Neural Inform. Processing Systems (Curran Associates, Red Hook, NY), 5387–5397.Google Scholar
Rinott Y (1978) On two-stage selection procedures and related probability-inequalities. Comm. Statist. Theory Methods 7(8):799–811.Crossref, Google Scholar
Russo D, Van Roy B (2017) Learning to optimize via information-directed sampling. Oper. Res. 66(1):230–252.Link, Google Scholar
Ryzhov IO (2016) On the convergence rates of expected improvement methods. Oper. Res. 64(6):1515–1528.Link, Google Scholar
Ryzhov IO, Powell WB, Frazier PI (2012) The knowledge gradient algorithm for a general class of online learning problems. Oper. Res. 60(1):180–195.Link, Google Scholar
Scott SL (2016) Overview of content experiments: Multi-armed bandit experiments. Accessed November 9, 2016, https://support.google.com/analytics/answer/2844870?hl=en.Google Scholar
Srinivas N, Krause A, Kakade S, Seeger M (2012) Information-theoretic regret bounds for Gaussian process optimization in the bandit setting. IEEE Trans. Inform. Theory 58(5):3250–3265.Google Scholar
Tang L, Rosales R, Singh A, Agarwal D (2013) Automatic ad format selection via contextual bandits. Proc. 22nd ACM Internat. Conf. Inform. Knowledge Management (ACM, New York), 1587–1594.Google Scholar
Thompson W (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.Crossref, Google Scholar
Villar SS, Bowden J, Wason J (2015) Multi-armed bandit models for the optimal design of clinical trials: Benefits and challenges. Statist. Sci. 30(2):199–215.Crossref, Google Scholar

Volume 68, Issue 6

November-December 2020

Pages ii-iv, 1625-1931, C2

Article Information

Supplemental Material

Metrics

Information

Received:December 12, 2016
Accepted:April 10, 2019
Published Online:April 16, 2020

Cite as

Daniel Russo (2020) Simple Bayesian Algorithms for Best-Arm Identification. Operations Research 68(6):1625-1647.

https://doi.org/10.1287/opre.2019.1911

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Simple Bayesian Algorithms for Best-Arm Identification

References

Volume 68, Issue 6

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News