Dual-Directed Algorithm Design for Efficient Pure Exploration

Chao Qin
Chao Qin
[email protected]
https://orcid.org/0000-0002-4600-6147
Stanford Graduate School of Business, Stanford University, Stanford, California 94305
Search for more papers by this author
,
Wei You
Corresponding Author
Wei You
[email protected]
https://orcid.org/0000-0003-0844-4194
Department of Industrial Engineering and Decision Analytics, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong SAR, China
Search for more papers by this author

Stanford Graduate School of Business, Stanford University, Stanford, California 94305

Corresponding Author

Wei You

Department of Industrial Engineering and Decision Analytics, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong SAR, China

Search for more papers by this author

Published Online:15 Jul 2025https://doi.org/10.1287/opre.2023.0590

References

Agrawal S, Juneja S, Glynn P (2020) Optimal δ-correct best-arm selection for heavy-tailed distributions. Kontorovich A, Neu G, eds. Proc. 31st Internat. Conf. Algorithmic Learn. Theory, vol. 117 (PMLR, Cambridge, MA), 61–110.Google Scholar
Al Marjani A, Kocak T, Garivier A (2022) On the complexity of all ε-best arms identification. Amini M-R, Canu S, Fischer A, Guns T, Novak PK, Tsoumakas G, eds. Proc. Joint Eur. Conf. Machine Learn. Knowledge Discovery Databases (Springer, Switzerland), 317–332.Google Scholar
Apostol TM (1969) Calculus, vol. 2, 2nd ed. (John Wiley & Sons, New York).Google Scholar
Avci H, Nelson BL, Song E, Wächter A (2023) Using cache or credit for parallel ranking and selection. ACM Trans. Modeling Comput. Simulations 33(4):1–28.Crossref, Google Scholar
Bandyopadhyay A, Juneja SK, Agrawal S (2024) Optimal top two method for best arm identification and fluid analysis. Globerson A, Mackey L, Belgrave D, Fan A, Paquet U, Tomczak J, Zhang C, eds. Proc. 38th Ann. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY).Google Scholar
Bechhofer RE (1954) A single-sample multiple decision procedure for ranking means of normal populations with known variances. Ann. Math. Statist. 25(1):16–39.Crossref, Google Scholar
Bubeck S, Wang T, Viswanathan N (2013) Multiple identifications in multi-armed bandits. Proc. Internat. Conf. Machine Learn. (JMLR, Cambridge, MA), 258–265.Google Scholar
Chapelle O, Li L (2011) An empirical evaluation of Thompson sampling. Proc. Adv. Neural Inform. Processing Systems, vol. 24 (Curran Associates Inc., Red Hook, NY).Google Scholar
Chen Y, Ryzhov IO (2023) Balancing optimal large deviations in sequential selection. Management Sci. 69(6):3457–3473.Link, Google Scholar
Chen CH, He D, Fu M, Lee LH (2008) Efficient simulation budget allocation for selecting an optimal subset. INFORMS J. Comput. 20(4):579–595.Link, Google Scholar
Chen CH, Lin J, Yücesan E, Chick SE (2000) Simulation budget allocation for further enhancing the efficiency of ordinal optimization. Discrete Event Dynamic Systems 10:251–270.Crossref, Google Scholar
Chernoff H (1952) Sequential design of experiments. Ann. Math. Statist. 30(3):755–770.Crossref, Google Scholar
Cover TM, Thomas JA (2006) Elements of Information Theory (John Wiley & Sons, Inc., Hoboken, NJ).Google Scholar
Degenne R, Koolen WM (2019) Pure exploration with multiple correct answers. Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 32 (Curran Associates Inc., Red Hook, NY), 14591–14600.Google Scholar
Eckman DJ, Henderson SG (2018) Guarantees on the probability of good selection. Proc. Winter Simulation Conf. (IEEE, Piscataway, NJ), 351–365.Google Scholar
Fan W, Hong LJ, Nelson BL (2016) Indifference-zone-free selection of the best. Oper. Res. 64(6):1499–1514.Link, Google Scholar
Fan W, Hong LJ, Jiang G, Luo J (2024) Review of large-scale simulation optimization. Preprint, submitted March 23, https://arxiv.org/abs/2403.15669.Google Scholar
Ferreira KJ, Simchi-Levi D, Wang H (2018) Online network revenue management using Thompson sampling. Oper. Res. 66(6):1586–1602.Link, Google Scholar
Frazier PI, Powell WB, Dayanik S (2008) A knowledge-gradient policy for sequential information collection. SIAM J. Control Optim. 47(5):2410–2439.Crossref, Google Scholar
Fu MC, Henderson SG (2017) History of seeking better solutions, aka simulation optimization. Proc. Winter Simulation Conf. (IEEE, Piscataway, NJ), 131–157.Google Scholar
Gao S, Chen W (2015) A note on the subset selection for simulation optimization. Proc. Winter Simulation Conf. (IEEE, Piscataway, NJ), 3768–3776.Google Scholar
Garivier A, Kaufmann E (2016) Optimal best arm identification with fixed confidence. Proc. Conf. Learn. Theory (JMLR, Cambridge, MA), 998–1027.Google Scholar
Garivier A, Kaufmann E (2021) Nonasymptotic sequential tests for overlapping hypotheses applied to near-optimal arm identification in bandit models. Sequential Anal. 40(1):61–96.Crossref, Google Scholar
Glynn P, Juneja S (2004) A large deviations perspective on ordinal optimization. Proc. Winter Simulation Conf. (ACM, New York).Google Scholar
Glynn P, Juneja S (2018) Selecting the best system and multi-armed bandits. Preprint, submitted July 16, https://arxiv.org/abs/1507.04564.Google Scholar
Graepel T, Candela JQ, Borchert T, Herbrich R (2010) Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoft’s Bing search engine. Proc. 27th Internat. Conf. Machine Learn. (Omnipress, Madison, WI).Google Scholar
Hong LJ, Fan W, Luo J (2021) Review on ranking and selection: A new perspective. Frontiers Engrg. Management 8(3):321–343.Crossref, Google Scholar
Hong LJ, Jiang G, Zhong Y (2022) Solving large-scale fixed-budget ranking and selection problems. INFORMS J. Comput. 34(6):2930–2949.Link, Google Scholar
Jennison C, Johnstone IM, Turnbull BW (1982) Asymptotically optimal procedures for sequential adaptive selection of the best of several normal means. Statistical Decision Theory and Related Topics III (Elsevier, Amsterdam), 55–86.Crossref, Google Scholar
Jourdan M, Degenne R, Kaufmann E (2023a) Dealing with unknown variances in best-arm identification. Proc. Internat. Conf. Algorithmic Learn. Theory (JMLR, Cambridge, MA), 776–849.Google Scholar
Jourdan M, Degenne R, Kaufmann E (2023b) An ε-best-arm identification algorithm for fixed-confidence and beyond. Proc. 37th Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY).Google Scholar
Jourdan M, Degenne R, Baudry D, de Heide R, Kaufmann E (2022) Top two algorithms revisited. Adv. Neural Inform. Processing Systems 35:26791–26803.Google Scholar
Kalyanakrishnan S, Tewari A, Auer P, Stone P (2012) PAC subset selection in stochastic multi-armed bandits. Proc. 29th Internat. Conf. Machine Learn., vol. 12 (Omnipress, Madison, WI), 655–662. Google Scholar
Kato M, Ariu K (2024) The role of contextual information in best arm identification. Preprint, submitted June 26, https://arxiv.org/abs/2106.14077.Google Scholar
Kaufmann E, Kalyanakrishnan S (2013) Information complexity in bandit subset selection. Proc. Conf. Learn. Theory (JMLR, Cambridge, MA), 228–251.Google Scholar
Kaufmann E, Koolen WM (2021) Mixture martingales revisited with applications to sequential tests and confidence intervals. J. Machine Learn. Res. 22(1):11140–11183.Google Scholar
Kaufmann E, Koolen WM, Garivier A (2018) Sequential test for the lowest mean: From Thompson to Murphy sampling. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 31. (Curran Associates Inc., Red Hook, NY).Google Scholar
Kim SH, Nelson BL (2001) A fully sequential procedure for indifference-zone selection in simulation. ACM Trans. Modeling Comput. Simulations 11(3):251–273.Crossref, Google Scholar
Komiyama J (2024) Suboptimal performance of the bayes optimal algorithm in frequentist best arm identification. Preprint, submitted February 10, https://arxiv.org/abs/2202.05193.Google Scholar
Lai TL, Liao OYW (2012) Efficient adaptive randomization and stopping rules in multi-arm clinical trials for testing a new treatment. Sequence Anal. 31(4):441–457.Crossref, Google Scholar
Li Z, Fan W, Hong LJ (2025) The (surprising) sample optimality of greedy procedures for large-scale ranking and selection. Management Sci. 71(2):1238–1259.Link, Google Scholar
Locatelli A, Gutzeit M, Carpentier A (2016) An optimal algorithm for the thresholding bandit problem. Proc. Internat. Conf. Machine Learn. (JMLR, Cambridge, MA), 1690–1698.Google Scholar
Mason B, Jain L, Tripathy A, Nowak R (2020) Finding all ϵ-good arms in stochastic bandits. Adv. Neural Inform. Processing Systems 33:20707–20718.Google Scholar
Ménard P (2019) Gradient ascent for active exploration in bandit problems. Preprint, submitted May 20, https://arxiv.org/abs/1905.08165.Google Scholar
Nelson BL, Matejcik FJ (1995) Using common random numbers for indifference-zone selection and multiple comparisons in simulation. Management Sci. 41(12):1935–1945.Link, Google Scholar
Perng SK (1969) A comparison of the asymptotic expected sample sizes of two sequential procedures for ranking problem. Ann. Math. Statist. 40(6):2198–2202.Crossref, Google Scholar
Pryzant R, Iter D, Li J, Lee YT, Zhu C, Zeng M (2023) Automatic prompt optimization with “gradient descent” and beam search. Proc. Conf. Empirical Methods Natural Language Processing (Association for Computational Linguistics, Stroudsburg, PA).Google Scholar
Qiao G, Tewari A (2024) An asymptotically optimal algorithm for the convex hull membership problem. Preprint, submitted February 3, https://arxiv.org/abs/2302.02033.Google Scholar
Qin C, Klabjan D, Russo D (2017) Improving the expected improvement algorithm. Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 30. (Curran Associates Inc., Red Hook, NY).Google Scholar
Russo D (2020) Simple Bayesian algorithms for best-arm identification. Oper. Res. 68(6):1625–1647.Link, Google Scholar
Russo D, Roy BV, Kazerouni A, Osband I, Wen Z (2018) A tutorial on Thompson sampling. Frontiers Machine Learn. 11(1):1–96.Crossref, Google Scholar
Shang X, Kaufmann E, Valko M (2019) A simple dynamic bandit algorithm for hyper-parameter tuning. Proc. 6th ICML Workshop Automated Machine Learn. (ICML, San Diego, CA).Google Scholar
Shang X, de Heide R, Ménard P, Kaufmann E, Valko M (2020) Fixed-confidence guarantees for Bayesian best-arm identification. Proc. Internat. Conf. Artificial Intelligence Statist. (JMLR, Cambridge, MA), 1823–1832.Google Scholar
Simchi-Levi D, Zheng Z, Zhu F (2024) A simple and optimal policy design with safety against heavy-tailed risk for stochastic bandits. Management Sci., ePub ahead of print October 30, https://doi.org/10.1287/mnsc.2022.03512.Google Scholar
Simchowitz M, Jamieson K, Recht B (2017) The Simulator: Understanding adaptive sampling in the moderate-confidence regime. Proc. Conf. Learn. Theory (JMLR, Cambridge, MA), 1794–1834.Google Scholar
Tirinzoni A, Degenne R (2022) On elimination strategies for bandit fixed-confidence identification. Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, eds. Adv. Neural Inform. Processing Systems, vol 35. (Curran Associates Inc., Red Hook, NY).Google Scholar
Wang PA, Tzeng RC, Proutiere A (2021) Fast pure exploration via Frank-Wolfe. Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, eds. Adv. Neural Inform. Processing Systems, vol. 34. (Curran Associates Inc., Red Hook, NY).Google Scholar
Wang W, Wan H, Chen X (2024) Bonferroni-free and indifference-zone-flexible sequential elimination procedures for ranking and selection. Oper. Res. 72(5):2119–2134.Link, Google Scholar
Wu D, Zhou E (2018a) Analyzing and provably improving fixed budget ranking and selection algorithms. Preprint, submitted November 26, https://arxiv.org/abs/1811.12183.Google Scholar
Wu D, Zhou E (2018b) Provably improving the optimal computing budget allocation algorithm. Proc. Winter Simulation Conf. (IEEE, Piscataway, NJ), 1921–1932.Google Scholar
You W, Qin C, Wang Z, Yang S (2023) Information-directed selection for top-two algorithms. Proc. 36th Annual Conf. Learn. Theory (JMLR, Cambridge, MA), 2850–2851.Google Scholar
Zhang G, Peng Y, Zhang J, Zhou E (2023) Asymptotically optimal sampling policy for selecting top-m alternatives. INFORMS J. Comput. 35(6):1261–1285.Link, Google Scholar
Zhong Y, Hong LJ (2022) Knockout-tournament procedures for large-scale ranking and selection in parallel computing environments. Oper. Res. 70(1):432–453.Link, Google Scholar
Zhou X, Hao B, Lattimore T, Kang J, Li L (2024) Sequential best-arm identification with application to P300 speller. Trans. Machine Learn. Res. (MIT Press, Cambridge, MA).Google Scholar

Volume 74, Issue 2

March-April 2026

Pages v-ix, 573-1152, iii-iv

Article Information

Supplemental Material

Metrics

Information

Received:October 30, 2023
Accepted:May 13, 2025
Published Online:July 15, 2025

Cite as

Chao Qin, Wei You (2025) Dual-Directed Algorithm Design for Efficient Pure Exploration. Operations Research 74(2):1104-1125.

https://doi.org/10.1287/opre.2023.0590

Keywords

Acknowledgments

The authors thank the anonymous reviewers and associate editor for valuable feedback that significantly improved the paper and Daniel Russo, Sandeep Juneja, Po-An Wang, Shane Henderson, Xiaowei Zhang, Jun Luo, the participants of the 2023 INFORMS Applied Probability Society Conference, the 2023 INFORMS Annual Meeting, and the 2024 INFORMS MSOM Conference for their insightful comments on this work. A preliminary version of this paper appeared as an extended abstract in the Proceedings of the 36th Annual Conference on Learning Theory, with the title “Information-Directed Selection for Top-Two Algorithms.”

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Dual-Directed Algorithm Design for Efficient Pure Exploration

References

Volume 74, Issue 2

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News