Abbasi-Yadkori Y (2013) Online learning for linearly parametrized control problems. Unpublished PhD thesis, University of Alberta, Edmonton.Google Scholar
Abbasi-Yadkori Y, Pál D, Szepesvári C (2011) Improved algorithms for linear stochastic bandits. Adv. Neural Inform. Processing Systems 24.Google Scholar
Agrawal K, Athey S, Kanodia A, Palikot E (2022) Personalized recommendations in edtech: Evidence from a randomized controlled trial. Preprint, submitted August 30, https://arxiv.org/abs/2208.13940.Google Scholar
Agrawal S, Goyal N (2013) Thompson sampling for contextual bandits with linear payoffs. Internat. Conf. Machine Learn. (PMLR, New York), 127–135.Google Scholar
Armitage P (1960) Sequential Medical Trials (Blackwell Scientific Publication, Oxford, UK).Google Scholar
Athey S, Wager S (2021) Policy learning with observational data. Econometrica 89(1):133–161.Crossref, Google Scholar
Bassen J, Balaji B, Schaarschmidt M, Thille C, Painter J, Zimmaro D, Games A, Fast E, Mitchell JC (2020) Reinforcement learning for the adaptive scheduling of educational activities. Proc. 2020 CHI Conf. Human Factors Comput. Systems (ACM, New York), 1–12.Google Scholar
Bastani H (2021) Predicting with proxies: Transfer learning in high dimension. Management Sci. 67(5):2964–2984.Link, Google Scholar
Bastani H, Bayati M (2020) Online decision making with high-dimensional covariates. Oper. Res. 68(1):276–294.Link, Google Scholar
Bembom O, van der Laan MJ (2008) Data-adaptive selection of the truncation level for inverse-probability-of-treatment-weighted estimators. Working paper, University of California, Berkeley, Division of Biostatistics, Berkeley, CA.Google Scholar
Bennett A, Kallus N (2020) Efficient policy learning from surrogate-loss classification reductions. Internat. Conf. Machine Learn. (PMLR, New York), 788–798.Google Scholar
Bercu B, Delyon B, Rio E (2015) Concentration Inequalities for Sums and Martingales (Springer, New York).Crossref, Google Scholar
Bertsimas D, Kallus N, Weinstein AM, Zhuo YD (2017) Personalized diabetes management using electronic medical records. Diabetes Care 40(2):210–217.Crossref, Google Scholar
Besbes O, Zeevi A (2009) Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Oper. Res. 57(6):1407–1420.Link, Google Scholar
Bibaut A, Kallus N, Dimakopoulou M, Chambaz A, van der Laan M (2021) Risk minimization from adaptively collected data: Guarantees for supervised and policy learning. Adv. Neural Inform. Processing Systems, vol. 34.Google Scholar
Bottou L, Peters J, Quiñonero-Candela J, Charles DX, Max Chickering D, Portugaly E, Ray D, Simard P, Snelson Ed (2013) Counterfactual reasoning and learning systems: The example of computational advertising. J. Machine Learning Res. 14(11).Google Scholar
Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Preprint, submitted April 25, https://arxiv.org/abs/1204.5721.Google Scholar
Cai D, He X (2011) Manifold adaptive experimental design for text categorization. IEEE Trans. Knowledge Data Engrg. 24(4):707–719.Crossref, Google Scholar
Chernozhukov V, Demirer M, Lewis G, Syrgkanis V (2019) Semi-parametric efficient policy learning with continuous actions. Adv. Neural Inform. Processing Systems, vol. 32.Google Scholar
Chu W, Li L, Reyzin L, Schapire R (2011) Contextual bandits with linear payoff functions. Proc. 14th Internat. Conf. Artificial Intelligence Statist. (JMLR Workshop and Conference Proceedings, New York), 208–214.Google Scholar
Collins LM, Murphy SA, Strecher V (2007) The multiphase optimization strategy (most) and the sequential multiple assignment randomized trial (smart): New methods for more potent ehealth interventions. Amer. J. Preventive Medicine 32(5):S112–S118.Crossref, Google Scholar
Dani V, Hayes TP, Kakade SM (2008) Stochastic linear optimization under bandit feedback. 21st Annual Conf. Learn. Theory (PMLR, New York).Google Scholar
Dimakopoulou M, Zhou Z, Athey S, Imbens G (2017) Estimation considerations in contextual bandits. Preprint, submitted November 19, https://arxiv.org/abs/1711.07077.Google Scholar
Duchi J (2016) Lecture notes for statistics 311/electrical engineering 377. Accessed February 2023, https://stanford.edu/class/stats311/Lectures/full_notes.pdf.Google Scholar
Dudík M, Langford J, Li L (2011) Doubly robust policy evaluation and learning. Preprint, submitted March 23, https://arxiv.org/abs/1103.4601.Google Scholar
Farias VF, Li AA (2019) Learning preferences with side information. Management Sci. 65(7):3131–3149.Link, Google Scholar
Fukuoka Y, Zhou M, Vittinghoff E, Haskell W, Goldberg K, Aswani A (2018) Objectively measured baseline physical activity patterns in women in the MPED trial: Cluster analysis. JMIR Public Health Surveillance 4(1):e10.Crossref, Google Scholar
Goldenshluger A, Zeevi A (2013) A linear response bandit problem. Stochastic Systems 3(1):230–261.Link, Google Scholar
Hadad V, Hirshberg DA, Zhan R, Wager S, Athey S (2021) Confidence intervals for policy evaluation in adaptive experiments. Proc. Natl. Acad. Sci. USA 118(15):e2014602118.Crossref, Google Scholar
Hoiles W, Schaar M (2016) Bounded off-policy evaluation with missing data for course recommendation and curriculum design. International Conf. Machine Learn. (PMLR, New York), 1596–1604.Google Scholar
Horvitz DG, Thompson DJ (1952) A generalization of sampling without replacement from a finite universe. J. Amer. Statist. Assoc. 47(260):663–685.Crossref, Google Scholar
Imbens GW (2004) Nonparametric estimation of average treatment effects under exogeneity: A review. Rev. Econom. Statist. 86(1):4–29.Crossref, Google Scholar
Jin Y (2022) Upper bounds on the Natarajan dimensions of some function classes. Preprint, submitted September 15, https://arxiv.org/abs/2209.07015.Google Scholar
Jin Y, Yang Z, Wang Z (2020) Is pessimism provably efficient for offline RL? Preprint, submitted December 30, https://arxiv.org/abs/2012.15085.Google Scholar
Joachims T, Swaminathan A, de Rijke M (2018) Deep learning with logged bandit feedback. Internat. Conf. Learn. Representations.Google Scholar
Kallus N (2018) Balanced policy evaluation and learning. Adv. Neural Inform. Processing Systems, vol. 31.Google Scholar
Kallus N, Udell M (2016) Dynamic assortment personalization in high dimensions. Preprint, submitted October 18, https://arxiv.org/abs/1610.05604.Google Scholar
Kallus N, Zhou A (2018) Confounding-robust policy improvement. Preprint, submitted May 22, https://arxiv.org/abs/1805.08593.Google Scholar
Karimi M, Jannach D, Jugovac M (2018) News recommender systems—Survey and roads ahead. Inform. Processing Management 54(6):1203–1227.Crossref, Google Scholar
Kim ES, Herbst RS, Wistuba II, Lee JJ, Blumenschein GR, Tsao A, Stewart DJ, et al. (2011) The battle trial: Personalizing therapy for lung cancer. Cancer Discovery 1(1):44–53.Crossref, Google Scholar
Kitagawa T, Tetenov A (2018) Who should be treated? Empirical welfare maximization methods for treatment choice. Econometrica 86(2):591–616.Crossref, Google Scholar
Lai TL, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.Crossref, Google Scholar
Lan AS, Baraniuk RG (2016) A contextual bandits framework for personalized learning action selection. EDM, 424–429.Google Scholar
Lee D, Oh B, Seo S, Lee KH (2020) News recommendation with topic-enriched knowledge graphs. Proc. 29th ACM Internat. Conf. Inform. Knowledge Management, 695–704.Google Scholar
Levine S, Kumar A, Tucker G, Fu J (2020) Offline reinforcement learning: Tutorial, review, and perspectives on open problems. Preprint, submitted May 4, https://arxiv.org/abs/2005.01643.Google Scholar
Li L, Lu Y, Zhou D (2017) Provably optimal algorithms for generalized linear contextual bandits. Proc. Internat. Conf. Machine Learning (PMLR, New York), 2071–2080.Google Scholar
Li L, Chu W, Langford J, Schapire RE (2010) A contextual-bandit approach to personalized news article recommendation. Proc. 19th Internat. Conf. World Wide Web (ACM, New York), 661–670.Google Scholar
Li L, Chu W, Langford J, Wang X (2011) Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. Proc. Fourth ACM Internat. Conf. Web Search Data Mining (ACM, New York), 297–306.Google Scholar
Luedtke AR, van der Laan MJ (2016) Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Ann. Statist. 44(2):713–742.Crossref, Google Scholar
Mandel T, Liu YE, Levine S, Brunskill E, Popovic Z (2014) Offline policy evaluation across representations with applications to educational games. AAMAS, 1077–1084.Google Scholar
Murphy SA (2003) Optimal dynamic treatment regimes. J. Roy. Statist. Soc. Ser. B Statist. Methodology 65(2):331–355.Crossref, Google Scholar
Murphy SA (2005) An experimental design for the development of adaptive treatment strategies. Statist. Medicine 24(10):1455–1481.Crossref, Google Scholar
Nie X, Tian X, Taylor J, Zou J (2018) Why adaptively collected data have negative bias and how to correct for it. Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 1261–1269.Google Scholar
Offer-Westort M, Coppock A, Green DP (2019) Adaptive experimental design: Prospects and applications in political science. Preprint, submitted June 5, https://dx.doi.org/10.2139/ssrn.3364402.Google Scholar
Offer-Westort M, Rosenzweig LR, Athey S (2021) Optimal policies to battle the coronavirus “infodemic” among social media users in Sub-Saharan Africa. OSF Registered Study, Open Science.Google Scholar
Rakhlin A, Sridharan K, Tewari A (2015) Sequential complexities and uniform martingale laws of large numbers. Probab. Theory Related Fields 161(1–2):111–153.Crossref, Google Scholar
Rigollet P, Zeevi A (2010) Nonparametric bandits with covariates. Preprint, submitted March 8, https://arxiv.org/abs/1003.1630.Google Scholar
Robins JM, Rotnitzky A, Zhao LP (1994) Estimation of regression coefficients when some regressors are not always observed. J. Amer. Statist. Assoc. 89(427):846–866.Crossref, Google Scholar
Russo D, Van Roy B (2014) Learning to optimize via posterior sampling. Math. Oper. Res. 39(4):1221–1243.Link, Google Scholar
Russo D, Van Roy B, Kazerouni A, Osband I, Wen Z (2017) A tutorial on Thompson sampling. Preprint, submitted July 7, https://arxiv.org/abs/1707.02038.Google Scholar
Sachdeva N, Su Y, Joachims T (2020) Off-policy bandits with deficient support. Proc. 26th ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining, 965–975.Google Scholar
Schnabel T, Bennett PN, Joachims T (2019) Shaping feedback data in recommender systems with interventions based on information foraging theory. Proc. 12th ACM Internat. Conf. Web Search Data Mining, 546–554.Google Scholar
Schnabel T, Amershi S, Bennett PN, Bailey P, Joachims T (2020) The impact of more transparent interfaces on behavior in personalized recommendation. Proc. 43rd Internat. ACM SIGIR Conf. Res. Development Inform. Retrieval, 991–1000.Google Scholar
Schnabel T, Swaminathan A, Singh A, Chandak N, Joachims T (2016) Recommendations as treatments: Debiasing learning and evaluation. Internat. Conf. Machine Learn. (PMLR, New York), 1670–1679.Google Scholar
Shin J, Ramdas A, Rinaldo A (2019) Are sample means in multi-armed bandits positively or negatively biased? Preprint, submitted May 27, https://arxiv.org/abs/1905.11397.Google Scholar
Simon R (1977) Adaptive treatment assignment methods and clinical trials. Biometrics 33(4):743–749.Crossref, Google Scholar
Su Y, Dimakopoulou M, Krishnamurthy A, Dudík M (2020) Doubly robust off-policy evaluation with shrinkage. Internat. Conf. Machine Learn. (PMLR, New York), 9167–9176.Google Scholar
Su Y, Wang L, Santacatterina M, Joachims T (2019) CAB: Continuous adaptive blending for policy evaluation and learning. Internat. Conf. Machine Learn. (PMLR, New York), 6005–6014.Google Scholar
Sverdrup E, Kanodia A, Zhou Z, Athey S, Wager S (2020) policytree: Policy learning via doubly robust empirical welfare maximization over trees. J. Open Source Software 5(50):2232.Crossref, Google Scholar
Swaminathan A, Joachims T (2015a) Batch learning from logged bandit feedback through counterfactual risk minimization. J. Machine Learn. Res. 16(1):1731–1755.Google Scholar
Swaminathan A, Joachims T (2015b) Counterfactual risk minimization: Learning from logged bandit feedback. Internat. Conf. Machine Learn. (PMLR, New York), 814–823.Google Scholar
Swaminathan A, Joachims T (2015c) The self-normalized estimator for counterfactual learning. Adv. Neural Inform. Processing Systems, 3231–3239.Google Scholar
Swaminathan A, Krishnamurthy A, Agarwal A, Dudík M, Langford J, Jose D, Zitouni I (2016) Off-policy evaluation for slate recommendation. Preprint, submitted May 16, https://arxiv.org/abs/1605.04812.Google Scholar
Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.Crossref, Google Scholar
Tsybakov AB (2008) Introduction to Nonparametric Estimation (Springer Science & Business Media, Berlin).Google Scholar
Van de Geer SA, van de Geer S (2000) Empirical Processes in M-estimation, vol. 6 (Cambridge University Press, Cambridge, UK).Google Scholar
Vanschoren J, van Rijn JN, Bischl B, Torgo L (2013) Openml: Networked science in machine learning. SIGKDD Explorations 15(2):49–60.Crossref, Google Scholar
Victor H, Giné E (1999) Decoupling: From Dependence to Independence (Springer, New York).Google Scholar
Wang YX, Agarwal A, Dudik M (2017) Optimal and adaptive off-policy evaluation in contextual bandits. Internat. Conf. Machine Learn. (PMLR, New York), 3589–3597.Google Scholar
Zeng C, Wang Q, Mokhtari S, Li T (2016) Online context-aware recommendation with time varying multi-armed bandit. Proc. 22nd ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (ACM, New York), 2025–2034.Google Scholar
Zhan R, Hadad V, Hirshberg DA, Athey S (2021) Off-policy evaluation via adaptive weighting with data from contextual bandits. Proc. 27th ACM SIGKDD Conf. Knowledge Discovery Data Mining (ACM, New York), 2125–2135.Google Scholar
Zhang B, Tsiatis AA, Davidian M, Zhang M, Laber E (2012) Estimating optimal treatment regimes from a classification perspective. Statist. 1(1):103–114.Crossref, Google Scholar
Zhao YQ, Zeng D, Laber EB, Song R, Yuan M, Kosorok MR (2015) Doubly robust learning for estimating individualized treatment with censored data. Biometrika 102(1):151–168.Crossref, Google Scholar
Zhou X, Mayer-Hamblett N, Khan U, Kosorok MR (2017) Residual weighted learning for estimating individualized treatment rules. J. Amer. Statist. Assoc. 112(517):169–187.Crossref, Google Scholar
Zhou Z, Athey S, Wager S (2023) Offline multi-action policy learning: Generalization and optimization. Oper. Res. 71(1):148–183.Link, Google Scholar

Volume 70, Issue 8

August 2024

Pages v-vii, 4953-5625, iii-v

Article Information

Supplemental Material

Metrics

Information

Received:June 09, 2021
Accepted:January 21, 2023
Published Online:October 16, 2023

Cite as

Ruohan Zhan, Zhimei Ren, Susan Athey, Zhengyuan Zhou (2023) Policy Learning with Adaptively Collected Data. Management Science 70(8):5270-5297.

https://doi.org/10.1287/mnsc.2023.4921

Keywords

Acknowledgments

The authors are grateful for helpful discussions with Vitor Hadad, David A. Hirshberg, Sanath Kumar Krishnamurthy, Stefan Wager, and Ruoxuan Xiong and for constructive feedback from the editors and referees. Authors R. Zhan and Z. Ren contributed equally.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Policy Learning with Adaptively Collected Data

References

Volume 70, Issue 8

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News