Multi-armed Bandit Experimental Design: Online Decision-Making and Adaptive Inference

Published Online:https://doi.org/10.1287/mnsc.2023.00492

References

  • Agarwal A, Ghuge R, Nagarajan V (2022) Batched dueling bandits. Chaudhuri K, Jegelka S, Song Le, Szepesvari C, Niu G, Sabato S, eds. Proc. 39th Internat. Conf. Machine Learn., vol. 162 (PMLR, New York), 89–110.Google Scholar
  • Agrawal R (1995) Sample mean based index policies by o (log n) regret for the multi-armed bandit problem. Adv. Appl. Probab. 27(4):1054–1078.CrossrefGoogle Scholar
  • Agrawal R, Hedge M, Teneketzis D (1988) Asymptotically efficient adaptive allocation rules for the multiarmed bandit problem with switching cost. IEEE Trans. Automatic Control 33(10):899–906.CrossrefGoogle Scholar
  • Aronow PM, Samii C (2017) Estimating average causal effects under general interference, with application to a social network experiment. Ann. Appl. Statist. 11(4):1912–1947.CrossrefGoogle Scholar
  • Atan O, Zame WR, van der Schaar M (2019) Sequential patient recruitment and allocation for adaptive clinical trials. Chaudhuri K, Sugiyama M, eds. Proc. 22nd Internat. Conf. Artificial Intelligence Statist., vol. 89 (PMLR, New York), 1891–1900.Google Scholar
  • Athey S, Wager S (2021) Policy learning with observational data. Econometrica 89(1):133–161.CrossrefGoogle Scholar
  • Athey S, Eckles D, Imbens GW (2018) Exact p-values for network interference. J. Amer. Statist. Assoc. 113(521):230–240.CrossrefGoogle Scholar
  • Auer P (2002) Using confidence bounds for exploitation-exploration trade-offs. J. Machine Learn. Res. 3(November):397–422.Google Scholar
  • Auer P, Cesa-Bianchi N, Fischer P (2002a) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47(2–3):235–256.CrossrefGoogle Scholar
  • Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (2002b) The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1):48–77.CrossrefGoogle Scholar
  • Bhat N, Farias VF, Moallemi CC, Sinha D (2020) Near-optimal ab testing. Management Sci. 66(10):4477–4495.LinkGoogle Scholar
  • Bojinov I, Rambachan A, Shephard N (2021) Panel experiments and dynamic causal effects: A finite population perspective. Quant. Econom. 12(4):1171–1196.CrossrefGoogle Scholar
  • Bojinov I, Simchi-Levi D, Zhao J (2023) Design and analysis of switchback experiments. Management Sci. 69(7):3759–3777.Google Scholar
  • Caria S, Kasy M, Quinn S, Shami S, Teytelboym A (2024) An adaptive targeted field experiment: Job search assistance for refugees in Jordan. J. Eur. Econom. Assoc. 22(2):781–836.Google Scholar
  • Carpentier A, Lazaric A, Ghavamzadeh M, Munos R, Auer P (2011) Upper-confidence-bound algorithms for active learning in multi-armed bandits. Kivinen J, Szepesvári C, Ukkonen E, Zeugmann T, eds. Internat. Conf. Algorithmic Learn. Theory (Springer Berlin Heidelberg, Berlin, Heidelberg), 189–203.Google Scholar
  • Carranza AG, Krishnamurthy SK, Athey S (2023) Flexible and efficient contextual bandits with heterogeneous treatment effect oracles. Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 7190–7212.Google Scholar
  • Cesa-Bianchi N, Dekel O, Shamir O (2013) Online learning with switching costs and other adaptive adversaries. Proc. 26th Internat. Conf. Neural Inform. Processing Systems, vol. 1 (Curran Associates Inc., Red Hook, NY), 1160–1168.Google Scholar
  • Chan HP, Lai TL (2006) Sequential generalized likelihood ratios and adaptive treatment allocation for optimal sequential selection. Sequential Anal. 25(2):179–201.CrossrefGoogle Scholar
  • Chapelle O, Li L (2011) An empirical evaluation of Thompson sampling. Proc. 24th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 2249–2257.Google Scholar
  • Chen N, Gao X, Xiong Y (2022) Debiasing samples from online learning using bootstrap. Proc. 25th Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 8514–8533.Google Scholar
  • Dimakopoulou M, Ren Z, Zhou Z (2024) Online multi-armed bandits with adaptive inference. Proc. 35th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 1939–1951.Google Scholar
  • Dimakopoulou M, Zhou Z, Athey S, Imbens G (2017) Estimation considerations in contextual bandits. Preprint, submitted November 19, https://arxiv.org/abs/1711.07077.Google Scholar
  • Dimakopoulou M, Zhou Z, Athey S, Imbens G (2019) Balanced linear contextual bandits. Proc. Thirty-Third AAAI Conf. Artificial Intelligence Thirty-First Innovative Appl. Artificial Intelligence Conf. Ninth AAAI Sympos. Educational Adv. Artificial Intelligence (AAAI Press, Palo Alto, CA).Google Scholar
  • Dudík M, Erhan D, Langford J, Li L (2014) Doubly robust policy evaluation and optimization. Statist. Sci. 29(4):485–511.CrossrefGoogle Scholar
  • Eckles D, Karrer B, Ugander J (2017) Design and analysis of experiments in networks: Reducing bias from interference. J. Causal Inference 5(1):20150021.CrossrefGoogle Scholar
  • Erraqabi A, Lazaric A, Valko M, Brunskill E, Liu YE (2017) Trading off rewards and errors in multi-armed bandits. Singh A, Zhu J, eds. Proc. 20th Internat. Conf. Artificial Intelligence Statist., vol. 54 (PMLR, New York), 709–717.Google Scholar
  • Esfandiari H, Karbasi A, Mehrabian A, Mirrokni V (2021) Regret bounds for batched bandits. Proc. Conf. AAAI Artificial Intelligence 35(8):7340–7348.CrossrefGoogle Scholar
  • Fan L, Glynn PW (2021) The fragility of optimized bandit algorithms. Preprint, submitted September 28, https://arxiv.org/abs/2109.13595.Google Scholar
  • Farias VF, Li AA, Peng T, Zheng AT (2022b) Markovian interference in experiments. Proc. 36th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY).Google Scholar
  • Farias V, Moallemi C, Peng T, Zheng A (2022a) Synthetically controlled bandits. Preprint, submitted February 14, https://arxiv.org/abs/2202.07079.Google Scholar
  • Fiez T, Gamez S, Chen A, Nassif H, Jain L (2022) Adaptive experimental design and counterfactual inference. Preprint, submitted October 25, https://arxiv.org/abs/2210.14369.Google Scholar
  • Freedman DA (1975) On tail probabilities for martingales. Ann. Probab. 3(1):100–118.Google Scholar
  • Gabillon V, Ghavamzadeh M, Lazaric A (2012) Best arm identification: A unified approach to fixed budget and fixed confidence. Proc. 25th Internat. Conf. Neural Inform. Processing Systems, vol. 2 (Curran Associates Inc., Red Hook, NY), 3212–3220.Google Scholar
  • Gao Z, Han Y, Ren Z, Zhou Z (2019) Batched multi-armed bandits problem. Proc. 33rd Internat. Conf. Neural Information Processing Systems (Curran Associates Inc., Red Hook, NY).Google Scholar
  • Garivier A, Cappé O (2011) The KL-UCB algorithm for bounded stochastic bandits and beyond. Proc. 24th Annual Conf. Learn. Theory (JMLR Workshop and Conference Proceedings), 359–376.Google Scholar
  • Garivier A, Kaufmann E (2016) Optimal best arm identification with fixed confidence. Feldman V, Rakhlin A, Shamir O, eds. 29th Annual Conf. Learn. Theory, vol. 49 (PMLR, New York), 998–1027.Google Scholar
  • Glynn PW, Johari R, Rasouli M (2020) Adaptive experimental design with temporal interference: A maximum likelihood approach. Proc. 34th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 15054–15064.Google Scholar
  • Greenewald K, Tewari A, Klasnja P, Murphy S (2017) Action centered contextual bandits. Proc. 31st Internat. Conf. Neural Information Processing Systems (Curran Associates Inc., Red Hook, NY), 5979–5987.Google Scholar
  • Hadad V, Hirshberg DA, Zhan R, Wager S, Athey S (2021) Confidence intervals for policy evaluation in adaptive experiments. Proc. Natl. Acad. Sci. USA 118(15):e2014602118.CrossrefGoogle Scholar
  • Hahn J, Hirano K, Karlan D (2011) Adaptive experimental design using the propensity score. J. Bus. Econom. Statist. 29(1):96–108.CrossrefGoogle Scholar
  • Ham DW, Bojinov I, Lindon M, Tingley M (2023) Design-based inference for multi-arm bandits. Preprint, submitted Feburary 27, https://arxiv.org/abs/2302.14136.Google Scholar
  • Han Q, Sun WW, Zhang Y (2022) Online statistical inference for matrix contextual bandit. Preprint, submitted December 21, https://arxiv.org/abs/2212.11385.Google Scholar
  • Heckman JJ, Vytlacil EJ (2001) Instrumental variables, selection models, and tight bounds on the average treatment effect. Lechner M, Pfeiffer F, eds. Econometric Evaluation of Labour Market Policies, ZEW Economic Studies, vol. 13 (Physica, Heidelberg, Germany), 1–15.CrossrefGoogle Scholar
  • Hirano K, Imbens GW, Ridder G (2003) Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71(4):1161–1189.CrossrefGoogle Scholar
  • Imbens GW, Angrist JD (1994) Identification and estimation of local average treatment effects. Econometrica 62(2):467–475.Google Scholar
  • Jennison C, Johnstone IM, Turnbull BW (1982) Asymptotically optimal procedures for sequential adaptive selection of the best of several normal means. Gupta SS, Berger JO, eds. Statistical Decision Theory and Related Topics III (Elsevier, Amsterdam), 55–86.CrossrefGoogle Scholar
  • Jin T, Tang J, Xu P, Huang K, Xiao X, Gu Q (2021) Almost optimal anytime algorithm for batched multi-armed bandits. Internat. Conf. Machine Learn. (PMLR, New York), 5065–5073.Google Scholar
  • Johari R, Pekelis L, Walsh DJ (2015) Always valid inference: Bringing sequential analysis to a/b testing. Preprint, submitted December 15, https://arxiv.org/abs/1512.04922.Google Scholar
  • Johari R, Li H, Liskovich I, Weintraub GY (2022) Experimental design in two-sided platforms: An analysis of bias. Management Sci. 68(10):7069–7089.LinkGoogle Scholar
  • Kallus N, Zhou A (2018) Confounding-robust policy improvement. Proc. 32nd Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 9289–9299.Google Scholar
  • Kalvit A, Zeevi A (2024) A closer look at the worst-case behavior of multi-armed bandit algorithms. Proc. 35th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 8807–8819.Google Scholar
  • Kasy M, Sautmann A (2021) Adaptive treatment assignment in experiments for policy choice. Econometrica 89(1):113–132.CrossrefGoogle Scholar
  • Kato M, Ishihara T, Honda J, Narita Y (2020) Efficient adaptive experimental design for average treatment effect estimation. Preprint, submitted Feburary 13, https://arxiv.org/abs/2002.05308.Google Scholar
  • Kaufmann E, Korda N, Munos R (2012) Thompson sampling: An asymptotically optimal finite-time analysis. Bshouty NH, Stoltz G, Vayatis N, Zeugmann T, eds. Internat. Conf. Algorithmic Learn. Theory (Springer Berlin Heidelberg, Berlin, Heidelberg), 199–213.Google Scholar
  • Kennedy EH (2020) Optimal doubly robust estimation of heterogeneous causal effects. Preprint, submitted April 29, https://arxiv.org/abs/2004.14497.Google Scholar
  • Khan S, Ugander J (2021) Adaptive normalization for IPW estimation. Preprint, submitted June 14, https://arxiv.org/abs/2106.07695.Google Scholar
  • Kim GS, Paik MC (2019) Contextual multi-armed bandit algorithm for semiparametric reward model. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learning, vol. 97 (PMLR, New York), 3389–3397.Google Scholar
  • Krishnamurthy A, Wu ZS, Syrgkanis V (2018) Semiparametric contextual bandits. Dy J, Krause A, eds. Proc. 35th Internat. Conf. Machine Learn. (PMLR, New York), 2776–2785.Google Scholar
  • Lai TL, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.CrossrefGoogle Scholar
  • Lattimore T, Szepesvári C (2020) Bandit Algorithms (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • Li L, Munos R, Szepesvári C (2015) Toward minimax off-policy value estimation. Lebanon G, Vishwanathan SVN, eds. Proc. Eighteenth Internat. Conf. Artificial Intelligence Statist., vol. 38 (PMLR, New York), 608–616.Google Scholar
  • Luedtke AR, Van Der Laan MJ (2016) Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Ann. Statist. 44(2):713–742.CrossrefGoogle Scholar
  • Mannor S, Tsitsiklis JN (2004) The sample complexity of exploration in the multi-armed bandit problem. J. Machine Learn. Res. 5(June):623–648.Google Scholar
  • Nie X, Tian X, Taylor J, Zou J (2018) Why adaptively collected data have negative bias and how to correct for it. Storkey A, Perez-Cruz F, eds. Proc. Twenty-First Internat. Conf. Artificial Intelligence Statist., vol. 84 (PMLR, New York), 1261–1269.Google Scholar
  • Offer-Westort M, Coppock A, Green DP (2021) Adaptive experimental design: Prospects and applications in political science. Amer. J. Political Sci. 65(4):826–844.CrossrefGoogle Scholar
  • Perchet V, Rigollet P, Chassang S, Snowberg E (2016) Batched bandit problems. Ann. Statist. 44(2):660–681.CrossrefGoogle Scholar
  • Qin C, Russo D (2022) Adaptivity and confounding in multi-armed bandit experiments. Preprint, submitted Feburary 18, https://arxiv.org/abs/2202.09036.Google Scholar
  • Russo D, Van Roy B (2016) An information-theoretic analysis of Thompson sampling. J. Machine Learn. Res. 17(1):2442–2471.Google Scholar
  • Scott SL (2015) Multi-armed bandit experiments in the online service economy. Appl. Stochastic Models Bus. Indust. 31(1):37–45.CrossrefGoogle Scholar
  • Seldin Y, Szepesvári C, Auer P, Abbasi-Yadkori Y (2013) Evaluation and analysis of the performance of the EXP3 algorithm in stochastic environments. Deisenroth MP, Szepesvári C, Peters J, eds. Proc. Tenth Eur. Workshop Reinforcement Learn., vol. 24 (PMLR, New York), 103–116.Google Scholar
  • Seldin Y, Cesa-Bianchi N, Auer P, Laviolette F, Shawe-Taylor J (2012) Pac-Bayes-Bernstein inequality for martingales and its application to multiarmed bandits. Proc. Workshop On-Line Trading Exploration Exploitation 2 (JMLR Workshop and Conference Proceedings), 98–111.Google Scholar
  • Simchi-Levi D, Xu Y (2023) Phase transitions in bandits with switching constraints. Management Sci. 69(12):7182–7201.Google Scholar
  • Simchi-Levi D, Zheng Z, Zhu F (2022) A simple and optimal policy design with safety against heavy-tailed risk for multi-armed bandits. Preprint, submitted June 7, https://arxiv.org/abs/2206.02969.Google Scholar
  • Swaminathan A, Joachims T (2015) Batch learning from logged bandit feedback through counterfactual risk minimization. J. Machine Learn. Res. 16(1):1731–1755.Google Scholar
  • Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.CrossrefGoogle Scholar
  • Villar SS, Bowden J, Wason J (2015) Multi-armed bandit models for the optimal design of clinical trials: Benefits and challenges. Statist Sci. 30(2):199–215.CrossrefGoogle Scholar
  • Wager S, Xu K (2021a) Diffusion asymptotics for sequential experiments. Preprint, submitted January 25, https://arxiv.org/abs/2101.09855.Google Scholar
  • Wager S, Xu K (2021b) Experimenting in equilibrium. Management Sci. 67(11):6694–6715.LinkGoogle Scholar
  • Wainwright MJ (2019) High-Dimensional Statistics: A Non-Asymptotic Viewpoint, vol. 48 (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • Wang YX, Agarwal A, Dudík M (2017) Optimal and adaptive off-policy evaluation in contextual bandits. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (JMLR.org), 3589–3597.Google Scholar
  • Xiong R, Athey S, Bayati M, Imbens G (2024) Optimal experimental design for staggered rollouts. Management Sci. 70(8):5317–5336.Google Scholar
  • Xu M, Qin T, Liu TY (2013) Estimation bias in multi-armed bandit algorithms for search advertising. Proc. 26th Internat. Conf. Neural Inform. Processing Systems, vol. 2 (Curran Associates Inc., Red Hook, NY), 2400–2408.Google Scholar
  • Yang F, Ramdas A, Jamieson KG, Wainwright M (2017) A framework for multi-A(rmed)/B(andit) testing with online FDR control. Proc. 31st Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 5959–5968.Google Scholar
  • Yao J, Brunskill E, Pan W, Murphy S, Doshi-Velez F (2021) Power constrained bandits. Jung K, Yeung S, Sendak M, Sjoding M, Ranganath R, eds. Proc. 6th Machine Learn. Healthcare Conf., vol. 149 (PMLR, New York), 209–259.Google Scholar
  • Zhan R, Hadad V, Hirshberg DA, Athey S (2021) Off-policy evaluation via adaptive weighting with data from contextual bandits. Proc. 27th ACM SIGKDD Conf. Knowledge Discovery Data Mining (Association for Computing Machinery, New York), 2125–2135.Google Scholar
  • Zhang K, Janson L, Murphy S (2020) Inference for batched bandits. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 9818–9829.Google Scholar
  • Zhong Z, Cheung WC, Tan VY (2021) On the pareto frontier of regret minimization and best arm identification in stochastic bandits. Preprint, submitted October 16, https://arxiv.org/abs/2110.08627.Google Scholar
  • Zhou Z, Athey S, Wager S (2023) Offline multi-action policy learning: Generalization and optimization. Oper. Res. 71(1):148–183.LinkGoogle Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.