Rate-Optimal Bayesian Simple Regret in Best Arm Identification

Junpei Komiyama
Corresponding Author
Junpei Komiyama
[email protected]
https://orcid.org/0000-0003-0095-6558
Stern School of Business, New York University, New York, New York 10012;
Search for more papers by this author
,
Kaito Ariu
Kaito Ariu
[email protected]
https://orcid.org/0000-0001-6286-9906
Artificial Intelligence Laboratory, CyberAgent, Inc., Shibuya, Tokyo 150-0042, Japan;School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, 114 28 Stockholm, Sweden;
Search for more papers by this author
,
Masahiro Kato
Masahiro Kato
[email protected]
Artificial Intelligence Laboratory, CyberAgent, Inc., Shibuya, Tokyo 150-0042, Japan;
Search for more papers by this author
,
Chao Qin
Chao Qin
[email protected]
Columbia Business School, Columbia University, New York, New York 10027
Search for more papers by this author

Junpei Komiyama

Corresponding Author

Junpei Komiyama

[email protected]

https://orcid.org/0000-0003-0095-6558

Stern School of Business, New York University, New York, New York 10012;

Search for more papers by this author

Kaito Ariu

[email protected]

https://orcid.org/0000-0001-6286-9906

Artificial Intelligence Laboratory, CyberAgent, Inc., Shibuya, Tokyo 150-0042, Japan;School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, 114 28 Stockholm, Sweden;

Search for more papers by this author

Masahiro Kato

[email protected]

Artificial Intelligence Laboratory, CyberAgent, Inc., Shibuya, Tokyo 150-0042, Japan;

Search for more papers by this author

Chao Qin

[email protected]

Columbia Business School, Columbia University, New York, New York 10027

Search for more papers by this author

Published Online:25 Aug 2023https://doi.org/10.1287/moor.2022.0011

References

[1] Agrawal S, Goyal N (2012) Analysis of Thompson sampling for the multi-armed bandit problem. Mannor S, Srebro N, Williamson RC, eds. Conf. Learn. Theory (JMLR.org), 39.1–39.26.Google Scholar
[2] Audibert J, Bubeck S, Munos R (2010) Best arm identification in multi-armed bandits. Kalai AT, Mohri M, eds. Conf. Learn. Theory (Omnipress Inc., Pensylvania), 41–53.Google Scholar
[3] Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47:235–256.Crossref, Google Scholar
[4] Bubeck S, Munos R, Stoltz G (2011) Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Comput. Sci. 412(19):1832–1852.Crossref, Google Scholar
[5] Bull AD (2011) Convergence rates of efficient global optimization algorithms. J. Machine Learn. Res. 12(88):2879–2904.Google Scholar
[6] Carpentier A, Locatelli A (2016) Tight (lower) bounds for the fixed budget best arm identification bandit problem. Feldman V, Rakhlin A, Shamir O, eds. Conf. Learn. Theory (JMLR.org), 590–604.Google Scholar
[7] Chen CH, Lin J, Yücesan E, Chick SE (2000) Simulation budget allocation for further enhancing theefficiency of ordinal optimization. Discrete Event Dynamic Systems 10(3):251–270.Crossref, Google Scholar
[8] Chernoff H (1959) Sequential design of experiments. Ann. Math. Statist. 30(3):755–770.Crossref, Google Scholar
[9] Even-Dar E, Mannor S, Mansour Y, Mahadevan S (2006) Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. J. Machine Learn. Res. 7:1079–1105.Google Scholar
[10] Frazier PI (2018) Bayesian optimization. Recent advances in optimization and modeling of contemporary problems. INFORMS TutORials in Operations Research, 255–278.Google Scholar
[11] Gittins JC (1989) Multi-armed Bandit Allocation Indices (Wiley, Chichester, NY).Google Scholar
[12] Glynn PW, Juneja S (2004) A large deviations perspective on ordinal optimization. Conf. Winter Simulation (IEEE Computer Society), 577–585.Google Scholar
[13] Granmo OC (2008) A Bayesian learning automaton for solving two-armed Bernoulli bandit problems. Bramer M, Petridis M, Coenen F, eds. SGAI Internat. Conf. Artificial Intelligence (Springer, London), 23–30.Google Scholar
[14] Ho YC, Sreenivas R, Vakili P (1992) Ordinal optimization of DEDS. Discrete Event Dynamic Systems 2:61–88.Crossref, Google Scholar
[15] Honda J, Takemura A (2015) Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards. J. Machine Learn. Res. 16(113):3721–3756.Google Scholar
[16] Hong LJ, Fan W, Luo J (2021) Review on ranking and selection: A new perspective. Frontiers Engrg. Management 8(3):321–343.Crossref, Google Scholar
[17] Kaufmann E (2014) Analysis of Bayesian and frequentist strategies for sequential resource allocation. PhD thesis, Telecom ParisTech, France.Google Scholar
[18] Kaufmann E, Cappé O, Garivier A (2016) On the complexity of best-arm identification in multi-armed bandit models. J. Machine Learn. Res. 17(1):1–42.Google Scholar
[19] Kaufmann E, Korda N, Munos R (2012) Thompson sampling: An asymptotically optimal finite-time analysis. Bshouty NH, Stoltz G, Vayatis N, Zeugmann T, eds. Algorithmic Learn. Theory (Springer, Berlin, Germany), 199–213.Crossref, Google Scholar
[20] Komiyama J, Tsuchiya T, Honda J (2022) Minimax optimal algorithms for fixed-budget best arm identification. Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, eds. Adv. Conf. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY).Google Scholar
[21] Lai TL (1987) Adaptive treatment allocation and the multi-armed bandit problem. Ann. Statist. 15(3):1091–1114.Crossref, Google Scholar
[22] Lai TL, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.Crossref, Google Scholar
[23] Maron O, Moore AW (1997) The racing algorithm: Model selection for lazy learners. Artificial Intelligence Rev. 11(1):193–225.Crossref, Google Scholar
[24] Paulson E (1964) A sequential procedure for selecting the population with the largest mean from k normal populations. Ann. Math. Statist. 35(1):174–180.Crossref, Google Scholar
[25] Peng Y, Chen CH, Fu MC, Hu JQ (2016) Dynamic sampling allocation and design selection. INFORMS J. Comput. 28(2):195–208.Link, Google Scholar
[26] Qin C, Russo D (2022) Adaptivity and confounding in multi-armed bandit experiments. CoRRabs/2202.09036, URL https://arxiv.org/abs/2202.09036.Google Scholar
[27] Qin C, Klabjan D, Russo D (2017) Improving the expected improvement algorithm. Guyon I, von Luxburg U, Bengio S, Wallach HM, Fergus R, Vishwanathan SVN, Garnett R, eds. Adv. Neural Inform. Processing Systems 30 (Curran Associates, Inc., Red Hook, NY) 5381–5391.Google Scholar
[28] Robbins H (1952) Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. (N.S.) 58(5):527–535.Crossref, Google Scholar
[29] Russo D (2020) Simple Bayesian algorithms for best-arm identification. Oper. Res. 68(6):1625–1647.Link, Google Scholar
[30] Russo D (2021) Technical note—A note on the equivalence of upper confidence bounds and Gittins indices for patient agents. Oper. Res. 69(1):273–278.Link, Google Scholar
[31] Russo D, Van Roy B (2018) Learning to optimize via information-directed sampling. Oper. Res. 66(1):230–252.Link, Google Scholar
[32] Ryzhov IO (2016) On the convergence rates of expected improvement methods. Oper. Res. 64(6):1515–1528.Link, Google Scholar
[33] Shahrampour S, Noshad M, Tarokh V (2017) On sequential elimination algorithms for best-arm identification in multi-armed bandits. IEEE Trans. Signal Processing 65(16):4281–4292.Crossref, Google Scholar
[34] Shang X, de Heide R, Ménard P, Kaufmann E, Valko M (2020) Fixed-confidence guarantees for Bayesian best-arm identification. Chiappa S, Calandra R, eds. Internat. Conf. Artificial Intelligence Statist. (JMLR.org), vol. 108, 1823–1832.Google Scholar
[35] Srinivas N, Krause A, Kakade SM, Seeger MW (2010) Gaussian process optimization in the bandit setting: No regret and experimental design. Furnkranz J, Joachims T, eds. Internat. Conf. Machine Learn. (JMLR.org), 1015–1022.Google Scholar
[36] Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.Crossref, Google Scholar
[37] Vakili S, Bouziani N, Jalali S, Bernacchia A, Shiu D-s (2021) Optimal order simple regret for Gaussian process bandits. Ranzato M, Beygelzimer A, Dauphin YN, Liang P, Vaughan JW, eds. Adv. Conf. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY), 21202–21215.Google Scholar
[38] Weber R (1992) On the Gittins index for multiarmed bandits. Ann. Appl. Probab. 2(4):1024–1033.Crossref, Google Scholar

cover image Mathematics of Operations Research

Volume 49, Issue 3

August 2024

Pages 1303-2047, C2

Article Information

Metrics

Information

Received:January 07, 2022
Accepted:July 09, 2023
Published Online:August 25, 2023

Cite as

Junpei Komiyama, Kaito Ariu, Masahiro Kato, Chao Qin (2023) Rate-Optimal Bayesian Simple Regret in Best Arm Identification. Mathematics of Operations Research 49(3):1629-1646.

https://doi.org/10.1287/moor.2022.0011

Keywords

Acknowledgments

The authors thank Po-An Wang and Kenshi Abe for several suggestions. The authors thank Assaf Zeevi for discussion on the stability of algorithms.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Rate-Optimal Bayesian Simple Regret in Best Arm Identification

References

Volume 49, Issue 3

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News