Multi-armed Bandit Experimental Design: Online Decision-Making and Adaptive Inference
References
- (2022) Batched dueling bandits. Chaudhuri K, Jegelka S, Song Le, Szepesvari C, Niu G, Sabato S, eds. Proc. 39th Internat. Conf. Machine Learn., vol. 162 (PMLR, New York), 89–110.Google Scholar
- (1995) Sample mean based index policies by o (log n) regret for the multi-armed bandit problem. Adv. Appl. Probab. 27(4):1054–1078.Crossref, Google Scholar
- (1988) Asymptotically efficient adaptive allocation rules for the multiarmed bandit problem with switching cost. IEEE Trans. Automatic Control 33(10):899–906.Crossref, Google Scholar
- (2017) Estimating average causal effects under general interference, with application to a social network experiment. Ann. Appl. Statist. 11(4):1912–1947.Crossref, Google Scholar
- (2019) Sequential patient recruitment and allocation for adaptive clinical trials. Chaudhuri K, Sugiyama M, eds. Proc. 22nd Internat. Conf. Artificial Intelligence Statist., vol. 89 (PMLR, New York), 1891–1900.Google Scholar
- (2021) Policy learning with observational data. Econometrica 89(1):133–161.Crossref, Google Scholar
- (2018) Exact p-values for network interference. J. Amer. Statist. Assoc. 113(521):230–240.Crossref, Google Scholar
- (2002) Using confidence bounds for exploitation-exploration trade-offs. J. Machine Learn. Res. 3(November):397–422.Google Scholar
- (2002a) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47(2–3):235–256.Crossref, Google Scholar
- (2002b) The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1):48–77.Crossref, Google Scholar
- (2020) Near-optimal ab testing. Management Sci. 66(10):4477–4495.Link, Google Scholar
- (2021) Panel experiments and dynamic causal effects: A finite population perspective. Quant. Econom. 12(4):1171–1196.Crossref, Google Scholar
- (2023) Design and analysis of switchback experiments. Management Sci. 69(7):3759–3777.Google Scholar
- (2024) An adaptive targeted field experiment: Job search assistance for refugees in Jordan. J. Eur. Econom. Assoc. 22(2):781–836.Google Scholar
- (2011) Upper-confidence-bound algorithms for active learning in multi-armed bandits. Kivinen J, Szepesvári C, Ukkonen E, Zeugmann T, eds. Internat. Conf. Algorithmic Learn. Theory (Springer Berlin Heidelberg, Berlin, Heidelberg), 189–203.Google Scholar
- (2023) Flexible and efficient contextual bandits with heterogeneous treatment effect oracles. Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 7190–7212.Google Scholar
- (2013) Online learning with switching costs and other adaptive adversaries. Proc. 26th Internat. Conf. Neural Inform. Processing Systems, vol. 1 (Curran Associates Inc., Red Hook, NY), 1160–1168.Google Scholar
- (2006) Sequential generalized likelihood ratios and adaptive treatment allocation for optimal sequential selection. Sequential Anal. 25(2):179–201.Crossref, Google Scholar
- (2011) An empirical evaluation of Thompson sampling. Proc. 24th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 2249–2257.Google Scholar
- (2022) Debiasing samples from online learning using bootstrap. Proc. 25th Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 8514–8533.Google Scholar
- (2024) Online multi-armed bandits with adaptive inference. Proc. 35th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 1939–1951.Google Scholar
- (2017) Estimation considerations in contextual bandits. Preprint, submitted November 19, https://arxiv.org/abs/1711.07077.Google Scholar
- (2019) Balanced linear contextual bandits. Proc. Thirty-Third AAAI Conf. Artificial Intelligence Thirty-First Innovative Appl. Artificial Intelligence Conf. Ninth AAAI Sympos. Educational Adv. Artificial Intelligence (AAAI Press, Palo Alto, CA).Google Scholar
- (2014) Doubly robust policy evaluation and optimization. Statist. Sci. 29(4):485–511.Crossref, Google Scholar
- (2017) Design and analysis of experiments in networks: Reducing bias from interference. J. Causal Inference 5(1):20150021.Crossref, Google Scholar
- (2017) Trading off rewards and errors in multi-armed bandits. Singh A, Zhu J, eds. Proc. 20th Internat. Conf. Artificial Intelligence Statist., vol. 54 (PMLR, New York), 709–717.Google Scholar
- (2021) Regret bounds for batched bandits. Proc. Conf. AAAI Artificial Intelligence 35(8):7340–7348.Crossref, Google Scholar
- (2021) The fragility of optimized bandit algorithms. Preprint, submitted September 28, https://arxiv.org/abs/2109.13595.Google Scholar
- (2022b) Markovian interference in experiments. Proc. 36th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY).Google Scholar
- (2022a) Synthetically controlled bandits. Preprint, submitted February 14, https://arxiv.org/abs/2202.07079.Google Scholar
- (2022) Adaptive experimental design and counterfactual inference. Preprint, submitted October 25, https://arxiv.org/abs/2210.14369.Google Scholar
- (1975) On tail probabilities for martingales. Ann. Probab. 3(1):100–118.Google Scholar
- (2012) Best arm identification: A unified approach to fixed budget and fixed confidence. Proc. 25th Internat. Conf. Neural Inform. Processing Systems, vol. 2 (Curran Associates Inc., Red Hook, NY), 3212–3220.Google Scholar
- (2019) Batched multi-armed bandits problem. Proc. 33rd Internat. Conf. Neural Information Processing Systems (Curran Associates Inc., Red Hook, NY).Google Scholar
- (2011) The KL-UCB algorithm for bounded stochastic bandits and beyond. Proc. 24th Annual Conf. Learn. Theory (JMLR Workshop and Conference Proceedings), 359–376.Google Scholar
- (2016) Optimal best arm identification with fixed confidence. Feldman V, Rakhlin A, Shamir O, eds. 29th Annual Conf. Learn. Theory, vol. 49 (PMLR, New York), 998–1027.Google Scholar
- (2020) Adaptive experimental design with temporal interference: A maximum likelihood approach. Proc. 34th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 15054–15064.Google Scholar
- (2017) Action centered contextual bandits. Proc. 31st Internat. Conf. Neural Information Processing Systems (Curran Associates Inc., Red Hook, NY), 5979–5987.Google Scholar
- (2021) Confidence intervals for policy evaluation in adaptive experiments. Proc. Natl. Acad. Sci. USA 118(15):e2014602118.Crossref, Google Scholar
- (2011) Adaptive experimental design using the propensity score. J. Bus. Econom. Statist. 29(1):96–108.Crossref, Google Scholar
- (2023) Design-based inference for multi-arm bandits. Preprint, submitted Feburary 27, https://arxiv.org/abs/2302.14136.Google Scholar
- (2022) Online statistical inference for matrix contextual bandit. Preprint, submitted December 21, https://arxiv.org/abs/2212.11385.Google Scholar
- (2001) Instrumental variables, selection models, and tight bounds on the average treatment effect. Lechner M, Pfeiffer F, eds. Econometric Evaluation of Labour Market Policies, ZEW Economic Studies, vol. 13 (Physica, Heidelberg, Germany), 1–15.Crossref, Google Scholar
- (2003) Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71(4):1161–1189.Crossref, Google Scholar
- Imbens GW, Angrist JD (1994) Identification and estimation of local average treatment effects. Econometrica 62(2):467–475.Google Scholar
- (1982) Asymptotically optimal procedures for sequential adaptive selection of the best of several normal means. Gupta SS, Berger JO, eds. Statistical Decision Theory and Related Topics III (Elsevier, Amsterdam), 55–86.Crossref, Google Scholar
- (2021) Almost optimal anytime algorithm for batched multi-armed bandits. Internat. Conf. Machine Learn. (PMLR, New York), 5065–5073.Google Scholar
- (2015) Always valid inference: Bringing sequential analysis to a/b testing. Preprint, submitted December 15, https://arxiv.org/abs/1512.04922.Google Scholar
- (2022) Experimental design in two-sided platforms: An analysis of bias. Management Sci. 68(10):7069–7089.Link, Google Scholar
- (2018) Confounding-robust policy improvement. Proc. 32nd Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 9289–9299.Google Scholar
- (2024) A closer look at the worst-case behavior of multi-armed bandit algorithms. Proc. 35th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 8807–8819.Google Scholar
- (2021) Adaptive treatment assignment in experiments for policy choice. Econometrica 89(1):113–132.Crossref, Google Scholar
- (2020) Efficient adaptive experimental design for average treatment effect estimation. Preprint, submitted Feburary 13, https://arxiv.org/abs/2002.05308.Google Scholar
- (2012) Thompson sampling: An asymptotically optimal finite-time analysis. Bshouty NH, Stoltz G, Vayatis N, Zeugmann T, eds. Internat. Conf. Algorithmic Learn. Theory (Springer Berlin Heidelberg, Berlin, Heidelberg), 199–213.Google Scholar
- (2020) Optimal doubly robust estimation of heterogeneous causal effects. Preprint, submitted April 29, https://arxiv.org/abs/2004.14497.Google Scholar
- (2021) Adaptive normalization for IPW estimation. Preprint, submitted June 14, https://arxiv.org/abs/2106.07695.Google Scholar
- (2019) Contextual multi-armed bandit algorithm for semiparametric reward model. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learning, vol. 97 (PMLR, New York), 3389–3397.Google Scholar
- (2018) Semiparametric contextual bandits. Dy J, Krause A, eds. Proc. 35th Internat. Conf. Machine Learn. (PMLR, New York), 2776–2785.Google Scholar
- (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.Crossref, Google Scholar
- (2020) Bandit Algorithms (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
- (2015) Toward minimax off-policy value estimation. Lebanon G, Vishwanathan SVN, eds. Proc. Eighteenth Internat. Conf. Artificial Intelligence Statist., vol. 38 (PMLR, New York), 608–616.Google Scholar
- (2016) Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Ann. Statist. 44(2):713–742.Crossref, Google Scholar
- (2004) The sample complexity of exploration in the multi-armed bandit problem. J. Machine Learn. Res. 5(June):623–648.Google Scholar
- (2018) Why adaptively collected data have negative bias and how to correct for it. Storkey A, Perez-Cruz F, eds. Proc. Twenty-First Internat. Conf. Artificial Intelligence Statist., vol. 84 (PMLR, New York), 1261–1269.Google Scholar
- (2021) Adaptive experimental design: Prospects and applications in political science. Amer. J. Political Sci. 65(4):826–844.Crossref, Google Scholar
- (2016) Batched bandit problems. Ann. Statist. 44(2):660–681.Crossref, Google Scholar
- (2022) Adaptivity and confounding in multi-armed bandit experiments. Preprint, submitted Feburary 18, https://arxiv.org/abs/2202.09036.Google Scholar
- (2016) An information-theoretic analysis of Thompson sampling. J. Machine Learn. Res. 17(1):2442–2471.Google Scholar
- (2015) Multi-armed bandit experiments in the online service economy. Appl. Stochastic Models Bus. Indust. 31(1):37–45.Crossref, Google Scholar
- (2013) Evaluation and analysis of the performance of the EXP3 algorithm in stochastic environments. Deisenroth MP, Szepesvári C, Peters J, eds. Proc. Tenth Eur. Workshop Reinforcement Learn., vol. 24 (PMLR, New York), 103–116.Google Scholar
- (2012) Pac-Bayes-Bernstein inequality for martingales and its application to multiarmed bandits. Proc. Workshop On-Line Trading Exploration Exploitation 2 (JMLR Workshop and Conference Proceedings), 98–111.Google Scholar
- (2023) Phase transitions in bandits with switching constraints. Management Sci. 69(12):7182–7201.Google Scholar
- (2022) A simple and optimal policy design with safety against heavy-tailed risk for multi-armed bandits. Preprint, submitted June 7, https://arxiv.org/abs/2206.02969.Google Scholar
- (2015) Batch learning from logged bandit feedback through counterfactual risk minimization. J. Machine Learn. Res. 16(1):1731–1755.Google Scholar
- (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.Crossref, Google Scholar
- (2015) Multi-armed bandit models for the optimal design of clinical trials: Benefits and challenges. Statist Sci. 30(2):199–215.Crossref, Google Scholar
- (2021a) Diffusion asymptotics for sequential experiments. Preprint, submitted January 25, https://arxiv.org/abs/2101.09855.Google Scholar
- (2021b) Experimenting in equilibrium. Management Sci. 67(11):6694–6715.Link, Google Scholar
- (2019) High-Dimensional Statistics: A Non-Asymptotic Viewpoint, vol. 48 (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
- (2017) Optimal and adaptive off-policy evaluation in contextual bandits. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (JMLR.org), 3589–3597.Google Scholar
- (2024) Optimal experimental design for staggered rollouts. Management Sci. 70(8):5317–5336.Google Scholar
- (2013) Estimation bias in multi-armed bandit algorithms for search advertising. Proc. 26th Internat. Conf. Neural Inform. Processing Systems, vol. 2 (Curran Associates Inc., Red Hook, NY), 2400–2408.Google Scholar
- (2017) A framework for multi-A(rmed)/B(andit) testing with online FDR control. Proc. 31st Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 5959–5968.Google Scholar
- (2021) Power constrained bandits. Jung K, Yeung S, Sendak M, Sjoding M, Ranganath R, eds. Proc. 6th Machine Learn. Healthcare Conf., vol. 149 (PMLR, New York), 209–259.Google Scholar
- (2021) Off-policy evaluation via adaptive weighting with data from contextual bandits. Proc. 27th ACM SIGKDD Conf. Knowledge Discovery Data Mining (Association for Computing Machinery, New York), 2125–2135.Google Scholar
- (2020) Inference for batched bandits. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 9818–9829.Google Scholar
- (2021) On the pareto frontier of regret minimization and best arm identification in stochastic bandits. Preprint, submitted October 16, https://arxiv.org/abs/2110.08627.Google Scholar
- (2023) Offline multi-action policy learning: Generalization and optimization. Oper. Res. 71(1):148–183.Link, Google Scholar

