Multi-armed Bandit Experimental Design: Online Decision-Making and Adaptive Inference

David Simchi-Levi
David Simchi-Levi
[email protected]
https://orcid.org/0000-0002-4650-1519
Institute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139; and Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139; and Operations Research Center, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139
Search for more papers by this author
,
Chonghuan Wang
Corresponding Author
Chonghuan Wang
[email protected]
https://orcid.org/0000-0003-4887-6004
Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139; and Center for Computational Science and Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139
Search for more papers by this author

Institute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139; and Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139; and Operations Research Center, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139

Search for more papers by this author

Chonghuan Wang

Corresponding Author

Chonghuan Wang

[email protected]

https://orcid.org/0000-0003-4887-6004

Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139; and Center for Computational Science and Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139

Search for more papers by this author

Published Online:20 Sep 2024https://doi.org/10.1287/mnsc.2023.00492

References

Agarwal A, Ghuge R, Nagarajan V (2022) Batched dueling bandits. Chaudhuri K, Jegelka S, Song Le, Szepesvari C, Niu G, Sabato S, eds. Proc. 39th Internat. Conf. Machine Learn., vol. 162 (PMLR, New York), 89–110.Google Scholar
Agrawal R (1995) Sample mean based index policies by o (log n) regret for the multi-armed bandit problem. Adv. Appl. Probab. 27(4):1054–1078.Crossref, Google Scholar
Agrawal R, Hedge M, Teneketzis D (1988) Asymptotically efficient adaptive allocation rules for the multiarmed bandit problem with switching cost. IEEE Trans. Automatic Control 33(10):899–906.Crossref, Google Scholar
Aronow PM, Samii C (2017) Estimating average causal effects under general interference, with application to a social network experiment. Ann. Appl. Statist. 11(4):1912–1947.Crossref, Google Scholar
Atan O, Zame WR, van der Schaar M (2019) Sequential patient recruitment and allocation for adaptive clinical trials. Chaudhuri K, Sugiyama M, eds. Proc. 22nd Internat. Conf. Artificial Intelligence Statist., vol. 89 (PMLR, New York), 1891–1900.Google Scholar
Athey S, Wager S (2021) Policy learning with observational data. Econometrica 89(1):133–161.Crossref, Google Scholar
Athey S, Eckles D, Imbens GW (2018) Exact p-values for network interference. J. Amer. Statist. Assoc. 113(521):230–240.Crossref, Google Scholar
Auer P (2002) Using confidence bounds for exploitation-exploration trade-offs. J. Machine Learn. Res. 3(November):397–422.Google Scholar
Auer P, Cesa-Bianchi N, Fischer P (2002a) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47(2–3):235–256.Crossref, Google Scholar
Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (2002b) The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1):48–77.Crossref, Google Scholar
Bhat N, Farias VF, Moallemi CC, Sinha D (2020) Near-optimal ab testing. Management Sci. 66(10):4477–4495.Link, Google Scholar
Bojinov I, Rambachan A, Shephard N (2021) Panel experiments and dynamic causal effects: A finite population perspective. Quant. Econom. 12(4):1171–1196.Crossref, Google Scholar
Bojinov I, Simchi-Levi D, Zhao J (2023) Design and analysis of switchback experiments. Management Sci. 69(7):3759–3777.Google Scholar
Caria S, Kasy M, Quinn S, Shami S, Teytelboym A (2024) An adaptive targeted field experiment: Job search assistance for refugees in Jordan. J. Eur. Econom. Assoc. 22(2):781–836.Google Scholar
Carpentier A, Lazaric A, Ghavamzadeh M, Munos R, Auer P (2011) Upper-confidence-bound algorithms for active learning in multi-armed bandits. Kivinen J, Szepesvári C, Ukkonen E, Zeugmann T, eds. Internat. Conf. Algorithmic Learn. Theory (Springer Berlin Heidelberg, Berlin, Heidelberg), 189–203.Google Scholar
Carranza AG, Krishnamurthy SK, Athey S (2023) Flexible and efficient contextual bandits with heterogeneous treatment effect oracles. Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 7190–7212.Google Scholar
Cesa-Bianchi N, Dekel O, Shamir O (2013) Online learning with switching costs and other adaptive adversaries. Proc. 26th Internat. Conf. Neural Inform. Processing Systems, vol. 1 (Curran Associates Inc., Red Hook, NY), 1160–1168.Google Scholar
Chan HP, Lai TL (2006) Sequential generalized likelihood ratios and adaptive treatment allocation for optimal sequential selection. Sequential Anal. 25(2):179–201.Crossref, Google Scholar
Chapelle O, Li L (2011) An empirical evaluation of Thompson sampling. Proc. 24th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 2249–2257.Google Scholar
Chen N, Gao X, Xiong Y (2022) Debiasing samples from online learning using bootstrap. Proc. 25th Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 8514–8533.Google Scholar
Dimakopoulou M, Ren Z, Zhou Z (2024) Online multi-armed bandits with adaptive inference. Proc. 35th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 1939–1951.Google Scholar
Dimakopoulou M, Zhou Z, Athey S, Imbens G (2017) Estimation considerations in contextual bandits. Preprint, submitted November 19, https://arxiv.org/abs/1711.07077.Google Scholar
Dimakopoulou M, Zhou Z, Athey S, Imbens G (2019) Balanced linear contextual bandits. Proc. Thirty-Third AAAI Conf. Artificial Intelligence Thirty-First Innovative Appl. Artificial Intelligence Conf. Ninth AAAI Sympos. Educational Adv. Artificial Intelligence (AAAI Press, Palo Alto, CA).Google Scholar
Dudík M, Erhan D, Langford J, Li L (2014) Doubly robust policy evaluation and optimization. Statist. Sci. 29(4):485–511.Crossref, Google Scholar
Eckles D, Karrer B, Ugander J (2017) Design and analysis of experiments in networks: Reducing bias from interference. J. Causal Inference 5(1):20150021.Crossref, Google Scholar
Erraqabi A, Lazaric A, Valko M, Brunskill E, Liu YE (2017) Trading off rewards and errors in multi-armed bandits. Singh A, Zhu J, eds. Proc. 20th Internat. Conf. Artificial Intelligence Statist., vol. 54 (PMLR, New York), 709–717.Google Scholar
Esfandiari H, Karbasi A, Mehrabian A, Mirrokni V (2021) Regret bounds for batched bandits. Proc. Conf. AAAI Artificial Intelligence 35(8):7340–7348.Crossref, Google Scholar
Fan L, Glynn PW (2021) The fragility of optimized bandit algorithms. Preprint, submitted September 28, https://arxiv.org/abs/2109.13595.Google Scholar
Farias VF, Li AA, Peng T, Zheng AT (2022b) Markovian interference in experiments. Proc. 36th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY).Google Scholar
Farias V, Moallemi C, Peng T, Zheng A (2022a) Synthetically controlled bandits. Preprint, submitted February 14, https://arxiv.org/abs/2202.07079.Google Scholar
Fiez T, Gamez S, Chen A, Nassif H, Jain L (2022) Adaptive experimental design and counterfactual inference. Preprint, submitted October 25, https://arxiv.org/abs/2210.14369.Google Scholar
Freedman DA (1975) On tail probabilities for martingales. Ann. Probab. 3(1):100–118.Google Scholar
Gabillon V, Ghavamzadeh M, Lazaric A (2012) Best arm identification: A unified approach to fixed budget and fixed confidence. Proc. 25th Internat. Conf. Neural Inform. Processing Systems, vol. 2 (Curran Associates Inc., Red Hook, NY), 3212–3220.Google Scholar
Gao Z, Han Y, Ren Z, Zhou Z (2019) Batched multi-armed bandits problem. Proc. 33rd Internat. Conf. Neural Information Processing Systems (Curran Associates Inc., Red Hook, NY).Google Scholar
Garivier A, Cappé O (2011) The KL-UCB algorithm for bounded stochastic bandits and beyond. Proc. 24th Annual Conf. Learn. Theory (JMLR Workshop and Conference Proceedings), 359–376.Google Scholar
Garivier A, Kaufmann E (2016) Optimal best arm identification with fixed confidence. Feldman V, Rakhlin A, Shamir O, eds. 29th Annual Conf. Learn. Theory, vol. 49 (PMLR, New York), 998–1027.Google Scholar
Glynn PW, Johari R, Rasouli M (2020) Adaptive experimental design with temporal interference: A maximum likelihood approach. Proc. 34th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 15054–15064.Google Scholar
Greenewald K, Tewari A, Klasnja P, Murphy S (2017) Action centered contextual bandits. Proc. 31st Internat. Conf. Neural Information Processing Systems (Curran Associates Inc., Red Hook, NY), 5979–5987.Google Scholar
Hadad V, Hirshberg DA, Zhan R, Wager S, Athey S (2021) Confidence intervals for policy evaluation in adaptive experiments. Proc. Natl. Acad. Sci. USA 118(15):e2014602118.Crossref, Google Scholar
Hahn J, Hirano K, Karlan D (2011) Adaptive experimental design using the propensity score. J. Bus. Econom. Statist. 29(1):96–108.Crossref, Google Scholar
Ham DW, Bojinov I, Lindon M, Tingley M (2023) Design-based inference for multi-arm bandits. Preprint, submitted Feburary 27, https://arxiv.org/abs/2302.14136.Google Scholar
Han Q, Sun WW, Zhang Y (2022) Online statistical inference for matrix contextual bandit. Preprint, submitted December 21, https://arxiv.org/abs/2212.11385.Google Scholar
Heckman JJ, Vytlacil EJ (2001) Instrumental variables, selection models, and tight bounds on the average treatment effect. Lechner M, Pfeiffer F, eds. Econometric Evaluation of Labour Market Policies, ZEW Economic Studies, vol. 13 (Physica, Heidelberg, Germany), 1–15.Crossref, Google Scholar
Hirano K, Imbens GW, Ridder G (2003) Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71(4):1161–1189.Crossref, Google Scholar
Imbens GW, Angrist JD (1994) Identification and estimation of local average treatment effects. Econometrica 62(2):467–475.Google Scholar
Jennison C, Johnstone IM, Turnbull BW (1982) Asymptotically optimal procedures for sequential adaptive selection of the best of several normal means. Gupta SS, Berger JO, eds. Statistical Decision Theory and Related Topics III (Elsevier, Amsterdam), 55–86.Crossref, Google Scholar
Jin T, Tang J, Xu P, Huang K, Xiao X, Gu Q (2021) Almost optimal anytime algorithm for batched multi-armed bandits. Internat. Conf. Machine Learn. (PMLR, New York), 5065–5073.Google Scholar
Johari R, Pekelis L, Walsh DJ (2015) Always valid inference: Bringing sequential analysis to a/b testing. Preprint, submitted December 15, https://arxiv.org/abs/1512.04922.Google Scholar
Johari R, Li H, Liskovich I, Weintraub GY (2022) Experimental design in two-sided platforms: An analysis of bias. Management Sci. 68(10):7069–7089.Link, Google Scholar
Kallus N, Zhou A (2018) Confounding-robust policy improvement. Proc. 32nd Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 9289–9299.Google Scholar
Kalvit A, Zeevi A (2024) A closer look at the worst-case behavior of multi-armed bandit algorithms. Proc. 35th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 8807–8819.Google Scholar
Kasy M, Sautmann A (2021) Adaptive treatment assignment in experiments for policy choice. Econometrica 89(1):113–132.Crossref, Google Scholar
Kato M, Ishihara T, Honda J, Narita Y (2020) Efficient adaptive experimental design for average treatment effect estimation. Preprint, submitted Feburary 13, https://arxiv.org/abs/2002.05308.Google Scholar
Kaufmann E, Korda N, Munos R (2012) Thompson sampling: An asymptotically optimal finite-time analysis. Bshouty NH, Stoltz G, Vayatis N, Zeugmann T, eds. Internat. Conf. Algorithmic Learn. Theory (Springer Berlin Heidelberg, Berlin, Heidelberg), 199–213.Google Scholar
Kennedy EH (2020) Optimal doubly robust estimation of heterogeneous causal effects. Preprint, submitted April 29, https://arxiv.org/abs/2004.14497.Google Scholar
Khan S, Ugander J (2021) Adaptive normalization for IPW estimation. Preprint, submitted June 14, https://arxiv.org/abs/2106.07695.Google Scholar
Kim GS, Paik MC (2019) Contextual multi-armed bandit algorithm for semiparametric reward model. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learning, vol. 97 (PMLR, New York), 3389–3397.Google Scholar
Krishnamurthy A, Wu ZS, Syrgkanis V (2018) Semiparametric contextual bandits. Dy J, Krause A, eds. Proc. 35th Internat. Conf. Machine Learn. (PMLR, New York), 2776–2785.Google Scholar
Lai TL, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.Crossref, Google Scholar
Lattimore T, Szepesvári C (2020) Bandit Algorithms (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
Li L, Munos R, Szepesvári C (2015) Toward minimax off-policy value estimation. Lebanon G, Vishwanathan SVN, eds. Proc. Eighteenth Internat. Conf. Artificial Intelligence Statist., vol. 38 (PMLR, New York), 608–616.Google Scholar
Luedtke AR, Van Der Laan MJ (2016) Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Ann. Statist. 44(2):713–742.Crossref, Google Scholar
Mannor S, Tsitsiklis JN (2004) The sample complexity of exploration in the multi-armed bandit problem. J. Machine Learn. Res. 5(June):623–648.Google Scholar
Nie X, Tian X, Taylor J, Zou J (2018) Why adaptively collected data have negative bias and how to correct for it. Storkey A, Perez-Cruz F, eds. Proc. Twenty-First Internat. Conf. Artificial Intelligence Statist., vol. 84 (PMLR, New York), 1261–1269.Google Scholar
Offer-Westort M, Coppock A, Green DP (2021) Adaptive experimental design: Prospects and applications in political science. Amer. J. Political Sci. 65(4):826–844.Crossref, Google Scholar
Perchet V, Rigollet P, Chassang S, Snowberg E (2016) Batched bandit problems. Ann. Statist. 44(2):660–681.Crossref, Google Scholar
Qin C, Russo D (2022) Adaptivity and confounding in multi-armed bandit experiments. Preprint, submitted Feburary 18, https://arxiv.org/abs/2202.09036.Google Scholar
Russo D, Van Roy B (2016) An information-theoretic analysis of Thompson sampling. J. Machine Learn. Res. 17(1):2442–2471.Google Scholar
Scott SL (2015) Multi-armed bandit experiments in the online service economy. Appl. Stochastic Models Bus. Indust. 31(1):37–45.Crossref, Google Scholar
Seldin Y, Szepesvári C, Auer P, Abbasi-Yadkori Y (2013) Evaluation and analysis of the performance of the EXP3 algorithm in stochastic environments. Deisenroth MP, Szepesvári C, Peters J, eds. Proc. Tenth Eur. Workshop Reinforcement Learn., vol. 24 (PMLR, New York), 103–116.Google Scholar
Seldin Y, Cesa-Bianchi N, Auer P, Laviolette F, Shawe-Taylor J (2012) Pac-Bayes-Bernstein inequality for martingales and its application to multiarmed bandits. Proc. Workshop On-Line Trading Exploration Exploitation 2 (JMLR Workshop and Conference Proceedings), 98–111.Google Scholar
Simchi-Levi D, Xu Y (2023) Phase transitions in bandits with switching constraints. Management Sci. 69(12):7182–7201.Google Scholar
Simchi-Levi D, Zheng Z, Zhu F (2022) A simple and optimal policy design with safety against heavy-tailed risk for multi-armed bandits. Preprint, submitted June 7, https://arxiv.org/abs/2206.02969.Google Scholar
Swaminathan A, Joachims T (2015) Batch learning from logged bandit feedback through counterfactual risk minimization. J. Machine Learn. Res. 16(1):1731–1755.Google Scholar
Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.Crossref, Google Scholar
Villar SS, Bowden J, Wason J (2015) Multi-armed bandit models for the optimal design of clinical trials: Benefits and challenges. Statist Sci. 30(2):199–215.Crossref, Google Scholar
Wager S, Xu K (2021a) Diffusion asymptotics for sequential experiments. Preprint, submitted January 25, https://arxiv.org/abs/2101.09855.Google Scholar
Wager S, Xu K (2021b) Experimenting in equilibrium. Management Sci. 67(11):6694–6715.Link, Google Scholar
Wainwright MJ (2019) High-Dimensional Statistics: A Non-Asymptotic Viewpoint, vol. 48 (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
Wang YX, Agarwal A, Dudík M (2017) Optimal and adaptive off-policy evaluation in contextual bandits. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (JMLR.org), 3589–3597.Google Scholar
Xiong R, Athey S, Bayati M, Imbens G (2024) Optimal experimental design for staggered rollouts. Management Sci. 70(8):5317–5336.Google Scholar
Xu M, Qin T, Liu TY (2013) Estimation bias in multi-armed bandit algorithms for search advertising. Proc. 26th Internat. Conf. Neural Inform. Processing Systems, vol. 2 (Curran Associates Inc., Red Hook, NY), 2400–2408.Google Scholar
Yang F, Ramdas A, Jamieson KG, Wainwright M (2017) A framework for multi-A(rmed)/B(andit) testing with online FDR control. Proc. 31st Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 5959–5968.Google Scholar
Yao J, Brunskill E, Pan W, Murphy S, Doshi-Velez F (2021) Power constrained bandits. Jung K, Yeung S, Sendak M, Sjoding M, Ranganath R, eds. Proc. 6th Machine Learn. Healthcare Conf., vol. 149 (PMLR, New York), 209–259.Google Scholar
Zhan R, Hadad V, Hirshberg DA, Athey S (2021) Off-policy evaluation via adaptive weighting with data from contextual bandits. Proc. 27th ACM SIGKDD Conf. Knowledge Discovery Data Mining (Association for Computing Machinery, New York), 2125–2135.Google Scholar
Zhang K, Janson L, Murphy S (2020) Inference for batched bandits. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 9818–9829.Google Scholar
Zhong Z, Cheung WC, Tan VY (2021) On the pareto frontier of regret minimization and best arm identification in stochastic bandits. Preprint, submitted October 16, https://arxiv.org/abs/2110.08627.Google Scholar
Zhou Z, Athey S, Wager S (2023) Offline multi-action policy learning: Generalization and optimization. Oper. Res. 71(1):148–183.Link, Google Scholar

Volume 71, Issue 6

June 2025

Pages iv-vi, 4533-5418

Article Information

Supplemental Material

Metrics

Information

Received:February 13, 2023
Accepted:January 28, 2024
Published Online:September 20, 2024

Cite as

David Simchi-Levi; , Chonghuan Wang (2024) Multi-armed Bandit Experimental Design: Online Decision-Making and Adaptive Inference. Management Science 71(6):4828-4846.

https://doi.org/10.1287/mnsc.2023.00492

Keywords

Acknowledgments

The authors also thank the department editor George Shanthikumar, the anonymous associate editor, and three anonymous referees whose comments improved the manuscript.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Multi-armed Bandit Experimental Design: Online Decision-Making and Adaptive Inference

References

Volume 71, Issue 6

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News