Simple Bayesian Algorithms for Best-Arm Identification
Published Online:16 Apr 2020https://doi.org/10.1287/opre.2019.1911
References
- (2012) Analysis of Thompson sampling for the multi-armed bandit problem. Mannor S, Srebro N, Williamson RC, eds. Proc. 21st Annual Conf. Learning Theory, Proceedings of Machine Learning Research, vol. 23 (PMLR), 39.1–39.26.Google Scholar
- (1961) The sequential design of experiments for infinitely many states of nature. Ann. Math. Statist. 32(3):774–799.Crossref, Google Scholar
- , Munos R (2010) Best arm identification in multi-armed bandits. Kalai AT, Mohri M, eds. COLT 23rd Conf. Learning Theory (Omnipress, Madison, WI), 41–53.Google Scholar
- (1999) The consistency of posterior distributions in nonparametric problems. Ann. Statist. 27(2):536–561.Crossref, Google Scholar
- , Sobel M(1954) A single-sample multiple decision procedure for ranking means of normal populations with known variances. Ann. Math. Statist. 25(2):16–39.Crossref, Google Scholar
- (2004) Bayesian statistics and the efficiency and ethics of clinical trials. Statist. Sci. 19(1):175–187.Crossref, Google Scholar
- (2009) Pure exploration in multi-armed bandits problems. Algorithmic Learning Theory, Lecture Notes in Computer Science, vol. 5809 (Springer, Berlin), 23–37.Google Scholar
- (2011) An empirical evaluation of Thompson sampling. Shawe-Taylor J, Zemel RS, Bartlett PL, Pereira F, Weinberger KQ, eds. Proc. Adv. Neural Inform. Processing Systems NIPS 2011 (Curran Associates, Red Hook, NY), 2249–2257.Google Scholar
- (2015) Ranking and selection: Efficient simulation budget allocation. Fu MC, ed. Handbook of Simulation Optimization, International Series in Operations Research & Management Science, vol. 216 (Springer, New York), 45–80.Google Scholar
- (2008) Efficient simulation budget allocation for selecting an optimal subset. INFORMS J. Comput. 20(4):579–595.Link, Google Scholar
- (2000) Simulation budget allocation for further enhancing the efficiency of ordinal optimization. Discrete Event Dynam. Systems 10(3):251–270.Crossref, Google Scholar
- (1959) Sequential design of experiments. Ann. Math. Statist. 30(3):755–770.Crossref, Google Scholar
- (1975) Approaches in sequential design of experiments. Srivastava JN, ed. A Survey of Statistical Design and Linear Models (North-Holland, Amsterdam), 67–90.Google Scholar
- (2012) Sequential sampling with economics of selection procedures. Management Sci. 58(3):550–569.Link, Google Scholar
- (2009) Economic analysis of simulation selection problems. Management Sci. 55(3):421–437.Link, Google Scholar
- (2001) New two-stage and sequential procedures for selecting the best simulated system. Oper. Res. 49(5):732–743.Link, Google Scholar
- (2010) Sequential sampling to myopically maximize the expected value of information. INFORMS J. Comput. 22(1):71–80.Link, Google Scholar
- (1986) On the consistency of Bayes estimates. Ann. Statist. 14(1):1–26.Crossref, Google Scholar
- (2002) PAC bounds for multi-armed bandit and Markov decision processes. Kivinen J, Sloan RH, eds. COLT ‘02 Proc. 15th Annual Conf. Comput. Learn. Theory, Lecture Notes in Computer Science, vol. 2375 (Springer, Berlin), 255–270.Google Scholar
- (2016) Indifference-zone-free selection of the best. Oper. Res. 64(6):1499–1514.Link, Google Scholar
- (2018) Online network revenue management using Thompson sampling. Oper. Res. 66(6):1586–1602.Link, Google Scholar
- (2014) A fully sequential elimination procedure for indifference-zone ranking and selection with tight bounds on probability of correct selection. Oper. Res. 62(4):926–942.Link, Google Scholar
- (2008) A knowledge-gradient policy for sequential information collection. SIAM J. Control Optim. 47(5):2410–2439.Crossref, Google Scholar
- (1963) On the asymptotic behavior of Bayes’ estimates in the discrete case. Ann. Math. Statist. 34(4):1386–1403.Crossref, Google Scholar
- (2012) Best arm identification: A unified approach to fixed budget and fixed confidence. Pereira F, Burges CJC, Bottou L, Weinberger KQ, eds. NIPS’12 Proc. 25th Internat. Conf. Neural Inform. Processing Systems, vol. 2 (Curran Associates, Red Hook, NY), 3212–3220.Google Scholar
- (2016) Optimal best arm identification with fixed confidence. Feldman V, Rakhlin A, Shamir O, eds. Proc. Conf. Learn. Theory (COLT) 2016, Proceedings of Machine Learning Research, vol. 49 (PMLR), 1028–1050.Google Scholar
- (2000) Convergence rates of posterior distributions. Ann. Statist. 28(2):500–531.Crossref, Google Scholar
- (1979) Bandit processes and dynamic allocation indices. J. Roy. Statist. Soc. B. 41(2):148–164.Crossref, Google Scholar
- (1974) A dynamic allocation index for the sequential design of experiments. Gani J, ed., Progress in Statistics (North-Holland, Amsterdam), 241–266.Google Scholar
- (2004) A large deviations perspective on ordinal optimization. Ingalls RG, Rossetti MD, Smith JS, Peters BA, eds. Proc. 2004 Winter Simulation Conf. 2004, vol. 1 (IEEE, Piscataway, NJ).Google Scholar
- (2015) Selecting the best system and multi-armed bandits. Preprint, submitted July 16, https://arxiv.org/abs/1507.04564.Google Scholar
- (2014) Thompson sampling for complex online problems. Xing EP, Jebara T, eds. ICML 14 Proc. 31st Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 32 (PMLR), 100–108.Google Scholar
- (2010) Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoft’s Bing search engine. Fürnkranz J, Joachims T, eds. Proc. 27th Internat. Conf. Machine Learn. (ICML-10) (Omnipress, Madison, WI), 13–20.Google Scholar
- (1996) Bayesian look ahead one-stage sampling allocations for selection of the best population. J. Statist. Planning Inference 54(2):229–244.Crossref, Google Scholar
- (2016) Optimistic Gittins indices. Lee DD, von Luxburg U, Garnett R, Sugiyama M, Guyon I, eds. NIPS’16 Proc. 30th Internat. Conf. Neural Inform. Processing Systems (Curran Associates, Red Hook, NY), 3161–3169.Google Scholar
- (2015) Discrete optimization via simulation. Fu MC, ed. Handbook of Simulation Optimization, International Series in Operations Research and Management Science, vol. 216 (Springer, New York), 9–44.Google Scholar
- (2013) Optimal sampling laws for stochastically constrained simulation optimization on finite sets. INFORMS J. Comput. 25(3):527–542.Link, Google Scholar
- (2014) Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. 48th Annual Conf. Inform. Sci. Systems (CISS) (IEEE, Piscataway, NJ), 1–6.Google Scholar
- (1982) Asymptotically optimal procedures for sequential adaptive selection of the best of several normal means. Gupta SS, Berger JO, eds. Statistical Decision Theory and Related Topics III (Academic Press, New York), 55–86.Google Scholar
- (2013) Almost optimal exploration in multi-armed bandits. Dasgupta S, McAllester D, eds. Proc. 30th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 28 (PMLR), 1238–1246.Google Scholar
- (2018) On Bayesian index policies for sequential resource allocation. Ann. Statist. 46(2):842–865.Crossref, Google Scholar
- (2013) Information complexity in bandit subset selection. J. Machine Learn. Res. 30:228–251.Google Scholar
- (2018) Mixture Martingales revisited with applications to sequential tests and confidence intervals. Preprint, submitted November 28, https://arxiv.org/abs/1811.11419. Google Scholar
- (2012) On Bayesian upper confidence bounds for bandit problems. Lawrence ND, Girolami MA, eds. Proc. 15th Internat. Conf. Artificial Intelligence Statist., Proceedings of Machine Learning Research, vol. 22 (PMLR), 592–600.Google Scholar
- (2014) On the complexity of best-arm identification in multi-armed bandit models. J. Machine Learning Res. 17.1(2016):1–42.Google Scholar
- (2012) Thompson sampling: An asymptotically optimal finite time analysis. Bshouty NH, Stoltz G, Vayatis N, Zeugmann T, eds. Internat. Conf. Algorithmic Learn. Theory, Lecture Notes in Computer Science, vol. 7568 (Springer, Berlin), 199–213.Google Scholar
- (1984) Second order efficiency in the sequential design of experiments. Ann. Statist. 12(2):510–532.Crossref, Google Scholar
- (1963) Asymptotically optimum sequential inference and design. Ann. Math. Statist. 34(3):705–750.Crossref, Google Scholar
- (2006) Selecting the best system. Henderson SG, Nelson BL, eds. Simulation, Handbooks in Operations Research and Management Science, vol. 13 (Elsevier, Amsterdam), 501–534.Google Scholar
- (2007) Recent advances in ranking and selection. Henderson S, Biller B, Hsieh M-h, Shortle J, eds. Proc. 39th Conf. Winter Simulation: 40 Years! Best Is Yet to Come (IEEE, Piscataway, NJ), 162–172.Google Scholar
- (2013) Thompson sampling for one-dimensional exponential family bandits. Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, eds. NIPS’13 Proc. 26th Internat. Conf. Neural Inform. Processing Systems, vol. 1 (Curran Associates, Red Hook, NY), 1448–1456.Google Scholar
- (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.Crossref, Google Scholar
- (2004) The sample complexity of exploration in the multi-armed bandit problem. J. Machine Learn. Res. 5:623–648.Google Scholar
- (2013) Active sequential hypothesis testing. Ann. Statist. 41(6):2703–2738.Crossref, Google Scholar
- (2017) Efficient ranking and selection in parallel computing environments. Oper. Res. 65(3):821–836.Link, Google Scholar
- (2013) Controlled sensing for multihypothesis testing. IEEE Trans. Automat. Control. 58(10):2451–2464.Crossref, Google Scholar
- (2015) Stochastically constrained ranking and selection via score. ACM Trans. Model. Comput. Simul. (TOMACS) 25(1):Article 1.Crossref, Google Scholar
- (1964) A sequential procedure for selecting the population with the largest mean from k normal populations. Ann. Math. Statist. 35(1):174–180.Crossref, Google Scholar
- (2017) Improving the expected improvement algorithm. von Luxburg U, Guyon I, Bengio S, Wallach H, Fergus R, eds. NIPS’17 Proc. 31st Internat. Conf. Neural Inform. Processing Systems (Curran Associates, Red Hook, NY), 5387–5397.Google Scholar
- (1978) On two-stage selection procedures and related probability-inequalities. Comm. Statist. Theory Methods 7(8):799–811.Crossref, Google Scholar
- (2017) Learning to optimize via information-directed sampling. Oper. Res. 66(1):230–252.Link, Google Scholar
- (2016) On the convergence rates of expected improvement methods. Oper. Res. 64(6):1515–1528.Link, Google Scholar
- (2012) The knowledge gradient algorithm for a general class of online learning problems. Oper. Res. 60(1):180–195.Link, Google Scholar
- (2016) Overview of content experiments: Multi-armed bandit experiments. Accessed November 9, 2016, https://support.google.com/analytics/answer/2844870?hl=en.Google Scholar
- (2012) Information-theoretic regret bounds for Gaussian process optimization in the bandit setting. IEEE Trans. Inform. Theory 58(5):3250–3265.Google Scholar
- (2013) Automatic ad format selection via contextual bandits. Proc. 22nd ACM Internat. Conf. Inform. Knowledge Management (ACM, New York), 1587–1594.Google Scholar
- (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.Crossref, Google Scholar
- (2015) Multi-armed bandit models for the optimal design of clinical trials: Benefits and challenges. Statist. Sci. 30(2):199–215.Crossref, Google Scholar

