Policy Learning with Adaptively Collected Data

Published Online:https://doi.org/10.1287/mnsc.2023.4921

References

  • Abbasi-Yadkori Y (2013) Online learning for linearly parametrized control problems. Unpublished PhD thesis, University of Alberta, Edmonton.Google Scholar
  • Abbasi-Yadkori Y, Pál D, Szepesvári C (2011) Improved algorithms for linear stochastic bandits. Adv. Neural Inform. Processing Systems 24.Google Scholar
  • Agrawal K, Athey S, Kanodia A, Palikot E (2022) Personalized recommendations in edtech: Evidence from a randomized controlled trial. Preprint, submitted August 30, https://arxiv.org/abs/2208.13940.Google Scholar
  • Agrawal S, Goyal N (2013) Thompson sampling for contextual bandits with linear payoffs. Internat. Conf. Machine Learn. (PMLR, New York), 127–135.Google Scholar
  • Armitage P (1960) Sequential Medical Trials (Blackwell Scientific Publication, Oxford, UK).Google Scholar
  • Athey S, Wager S (2021) Policy learning with observational data. Econometrica 89(1):133–161.CrossrefGoogle Scholar
  • Bassen J, Balaji B, Schaarschmidt M, Thille C, Painter J, Zimmaro D, Games A, Fast E, Mitchell JC (2020) Reinforcement learning for the adaptive scheduling of educational activities. Proc. 2020 CHI Conf. Human Factors Comput. Systems (ACM, New York), 1–12.Google Scholar
  • Bastani H (2021) Predicting with proxies: Transfer learning in high dimension. Management Sci. 67(5):2964–2984.LinkGoogle Scholar
  • Bastani H, Bayati M (2020) Online decision making with high-dimensional covariates. Oper. Res. 68(1):276–294.LinkGoogle Scholar
  • Bembom O, van der Laan MJ (2008) Data-adaptive selection of the truncation level for inverse-probability-of-treatment-weighted estimators. Working paper, University of California, Berkeley, Division of Biostatistics, Berkeley, CA.Google Scholar
  • Bennett A, Kallus N (2020) Efficient policy learning from surrogate-loss classification reductions. Internat. Conf. Machine Learn. (PMLR, New York), 788–798.Google Scholar
  • Bercu B, Delyon B, Rio E (2015) Concentration Inequalities for Sums and Martingales (Springer, New York).CrossrefGoogle Scholar
  • Bertsimas D, Kallus N, Weinstein AM, Zhuo YD (2017) Personalized diabetes management using electronic medical records. Diabetes Care 40(2):210–217.CrossrefGoogle Scholar
  • Besbes O, Zeevi A (2009) Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Oper. Res. 57(6):1407–1420.LinkGoogle Scholar
  • Bibaut A, Kallus N, Dimakopoulou M, Chambaz A, van der Laan M (2021) Risk minimization from adaptively collected data: Guarantees for supervised and policy learning. Adv. Neural Inform. Processing Systems, vol. 34.Google Scholar
  • Bottou L, Peters J, Quiñonero-Candela J, Charles DX, Max Chickering D, Portugaly E, Ray D, Simard P, Snelson Ed (2013) Counterfactual reasoning and learning systems: The example of computational advertising. J. Machine Learning Res. 14(11).Google Scholar
  • Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Preprint, submitted April 25, https://arxiv.org/abs/1204.5721.Google Scholar
  • Cai D, He X (2011) Manifold adaptive experimental design for text categorization. IEEE Trans. Knowledge Data Engrg. 24(4):707–719.CrossrefGoogle Scholar
  • Chernozhukov V, Demirer M, Lewis G, Syrgkanis V (2019) Semi-parametric efficient policy learning with continuous actions. Adv. Neural Inform. Processing Systems, vol. 32.Google Scholar
  • Chu W, Li L, Reyzin L, Schapire R (2011) Contextual bandits with linear payoff functions. Proc. 14th Internat. Conf. Artificial Intelligence Statist. (JMLR Workshop and Conference Proceedings, New York), 208–214.Google Scholar
  • Collins LM, Murphy SA, Strecher V (2007) The multiphase optimization strategy (most) and the sequential multiple assignment randomized trial (smart): New methods for more potent ehealth interventions. Amer. J. Preventive Medicine 32(5):S112–S118.CrossrefGoogle Scholar
  • Dani V, Hayes TP, Kakade SM (2008) Stochastic linear optimization under bandit feedback. 21st Annual Conf. Learn. Theory (PMLR, New York).Google Scholar
  • Dimakopoulou M, Zhou Z, Athey S, Imbens G (2017) Estimation considerations in contextual bandits. Preprint, submitted November 19, https://arxiv.org/abs/1711.07077.Google Scholar
  • Duchi J (2016) Lecture notes for statistics 311/electrical engineering 377. Accessed February 2023, https://stanford.edu/class/stats311/Lectures/full_notes.pdf.Google Scholar
  • Dudík M, Langford J, Li L (2011) Doubly robust policy evaluation and learning. Preprint, submitted March 23, https://arxiv.org/abs/1103.4601.Google Scholar
  • Farias VF, Li AA (2019) Learning preferences with side information. Management Sci. 65(7):3131–3149.LinkGoogle Scholar
  • Fukuoka Y, Zhou M, Vittinghoff E, Haskell W, Goldberg K, Aswani A (2018) Objectively measured baseline physical activity patterns in women in the MPED trial: Cluster analysis. JMIR Public Health Surveillance 4(1):e10.CrossrefGoogle Scholar
  • Goldenshluger A, Zeevi A (2013) A linear response bandit problem. Stochastic Systems 3(1):230–261.LinkGoogle Scholar
  • Hadad V, Hirshberg DA, Zhan R, Wager S, Athey S (2021) Confidence intervals for policy evaluation in adaptive experiments. Proc. Natl. Acad. Sci. USA 118(15):e2014602118.CrossrefGoogle Scholar
  • Hoiles W, Schaar M (2016) Bounded off-policy evaluation with missing data for course recommendation and curriculum design. International Conf. Machine Learn. (PMLR, New York), 1596–1604.Google Scholar
  • Horvitz DG, Thompson DJ (1952) A generalization of sampling without replacement from a finite universe. J. Amer. Statist. Assoc. 47(260):663–685.CrossrefGoogle Scholar
  • Imbens GW (2004) Nonparametric estimation of average treatment effects under exogeneity: A review. Rev. Econom. Statist. 86(1):4–29.CrossrefGoogle Scholar
  • Jin Y (2022) Upper bounds on the Natarajan dimensions of some function classes. Preprint, submitted September 15, https://arxiv.org/abs/2209.07015.Google Scholar
  • Jin Y, Yang Z, Wang Z (2020) Is pessimism provably efficient for offline RL? Preprint, submitted December 30, https://arxiv.org/abs/2012.15085.Google Scholar
  • Joachims T, Swaminathan A, de Rijke M (2018) Deep learning with logged bandit feedback. Internat. Conf. Learn. Representations.Google Scholar
  • Kallus N (2018) Balanced policy evaluation and learning. Adv. Neural Inform. Processing Systems, vol. 31.Google Scholar
  • Kallus N, Udell M (2016) Dynamic assortment personalization in high dimensions. Preprint, submitted October 18, https://arxiv.org/abs/1610.05604.Google Scholar
  • Kallus N, Zhou A (2018) Confounding-robust policy improvement. Preprint, submitted May 22, https://arxiv.org/abs/1805.08593.Google Scholar
  • Karimi M, Jannach D, Jugovac M (2018) News recommender systems—Survey and roads ahead. Inform. Processing Management 54(6):1203–1227.CrossrefGoogle Scholar
  • Kim ES, Herbst RS, Wistuba II, Lee JJ, Blumenschein GR, Tsao A, Stewart DJ, et al. (2011) The battle trial: Personalizing therapy for lung cancer. Cancer Discovery 1(1):44–53.CrossrefGoogle Scholar
  • Kitagawa T, Tetenov A (2018) Who should be treated? Empirical welfare maximization methods for treatment choice. Econometrica 86(2):591–616.CrossrefGoogle Scholar
  • Lai TL, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.CrossrefGoogle Scholar
  • Lan AS, Baraniuk RG (2016) A contextual bandits framework for personalized learning action selection. EDM, 424–429.Google Scholar
  • Lee D, Oh B, Seo S, Lee KH (2020) News recommendation with topic-enriched knowledge graphs. Proc. 29th ACM Internat. Conf. Inform. Knowledge Management, 695–704.Google Scholar
  • Levine S, Kumar A, Tucker G, Fu J (2020) Offline reinforcement learning: Tutorial, review, and perspectives on open problems. Preprint, submitted May 4, https://arxiv.org/abs/2005.01643.Google Scholar
  • Li L, Lu Y, Zhou D (2017) Provably optimal algorithms for generalized linear contextual bandits. Proc. Internat. Conf. Machine Learning (PMLR, New York), 2071–2080.Google Scholar
  • Li L, Chu W, Langford J, Schapire RE (2010) A contextual-bandit approach to personalized news article recommendation. Proc. 19th Internat. Conf. World Wide Web (ACM, New York), 661–670.Google Scholar
  • Li L, Chu W, Langford J, Wang X (2011) Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. Proc. Fourth ACM Internat. Conf. Web Search Data Mining (ACM, New York), 297–306.Google Scholar
  • Luedtke AR, van der Laan MJ (2016) Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Ann. Statist. 44(2):713–742.CrossrefGoogle Scholar
  • Mandel T, Liu YE, Levine S, Brunskill E, Popovic Z (2014) Offline policy evaluation across representations with applications to educational games. AAMAS, 1077–1084.Google Scholar
  • Murphy SA (2003) Optimal dynamic treatment regimes. J. Roy. Statist. Soc. Ser. B Statist. Methodology 65(2):331–355.CrossrefGoogle Scholar
  • Murphy SA (2005) An experimental design for the development of adaptive treatment strategies. Statist. Medicine 24(10):1455–1481.CrossrefGoogle Scholar
  • Nie X, Tian X, Taylor J, Zou J (2018) Why adaptively collected data have negative bias and how to correct for it. Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 1261–1269.Google Scholar
  • Offer-Westort M, Coppock A, Green DP (2019) Adaptive experimental design: Prospects and applications in political science. Preprint, submitted June 5, https://dx.doi.org/10.2139/ssrn.3364402.Google Scholar
  • Offer-Westort M, Rosenzweig LR, Athey S (2021) Optimal policies to battle the coronavirus “infodemic” among social media users in Sub-Saharan Africa. OSF Registered Study, Open Science.Google Scholar
  • Rakhlin A, Sridharan K, Tewari A (2015) Sequential complexities and uniform martingale laws of large numbers. Probab. Theory Related Fields 161(1–2):111–153.CrossrefGoogle Scholar
  • Rigollet P, Zeevi A (2010) Nonparametric bandits with covariates. Preprint, submitted March 8, https://arxiv.org/abs/1003.1630.Google Scholar
  • Robins JM, Rotnitzky A, Zhao LP (1994) Estimation of regression coefficients when some regressors are not always observed. J. Amer. Statist. Assoc. 89(427):846–866.CrossrefGoogle Scholar
  • Russo D, Van Roy B (2014) Learning to optimize via posterior sampling. Math. Oper. Res. 39(4):1221–1243.LinkGoogle Scholar
  • Russo D, Van Roy B, Kazerouni A, Osband I, Wen Z (2017) A tutorial on Thompson sampling. Preprint, submitted July 7, https://arxiv.org/abs/1707.02038.Google Scholar
  • Sachdeva N, Su Y, Joachims T (2020) Off-policy bandits with deficient support. Proc. 26th ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining, 965–975.Google Scholar
  • Schnabel T, Bennett PN, Joachims T (2019) Shaping feedback data in recommender systems with interventions based on information foraging theory. Proc. 12th ACM Internat. Conf. Web Search Data Mining, 546–554.Google Scholar
  • Schnabel T, Amershi S, Bennett PN, Bailey P, Joachims T (2020) The impact of more transparent interfaces on behavior in personalized recommendation. Proc. 43rd Internat. ACM SIGIR Conf. Res. Development Inform. Retrieval, 991–1000.Google Scholar
  • Schnabel T, Swaminathan A, Singh A, Chandak N, Joachims T (2016) Recommendations as treatments: Debiasing learning and evaluation. Internat. Conf. Machine Learn. (PMLR, New York), 1670–1679.Google Scholar
  • Shin J, Ramdas A, Rinaldo A (2019) Are sample means in multi-armed bandits positively or negatively biased? Preprint, submitted May 27, https://arxiv.org/abs/1905.11397.Google Scholar
  • Simon R (1977) Adaptive treatment assignment methods and clinical trials. Biometrics 33(4):743–749.CrossrefGoogle Scholar
  • Su Y, Dimakopoulou M, Krishnamurthy A, Dudík M (2020) Doubly robust off-policy evaluation with shrinkage. Internat. Conf. Machine Learn. (PMLR, New York), 9167–9176.Google Scholar
  • Su Y, Wang L, Santacatterina M, Joachims T (2019) CAB: Continuous adaptive blending for policy evaluation and learning. Internat. Conf. Machine Learn. (PMLR, New York), 6005–6014.Google Scholar
  • Sverdrup E, Kanodia A, Zhou Z, Athey S, Wager S (2020) policytree: Policy learning via doubly robust empirical welfare maximization over trees. J. Open Source Software 5(50):2232.CrossrefGoogle Scholar
  • Swaminathan A, Joachims T (2015a) Batch learning from logged bandit feedback through counterfactual risk minimization. J. Machine Learn. Res. 16(1):1731–1755.Google Scholar
  • Swaminathan A, Joachims T (2015b) Counterfactual risk minimization: Learning from logged bandit feedback. Internat. Conf. Machine Learn. (PMLR, New York), 814–823.Google Scholar
  • Swaminathan A, Joachims T (2015c) The self-normalized estimator for counterfactual learning. Adv. Neural Inform. Processing Systems, 3231–3239.Google Scholar
  • Swaminathan A, Krishnamurthy A, Agarwal A, Dudík M, Langford J, Jose D, Zitouni I (2016) Off-policy evaluation for slate recommendation. Preprint, submitted May 16, https://arxiv.org/abs/1605.04812.Google Scholar
  • Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.CrossrefGoogle Scholar
  • Tsybakov AB (2008) Introduction to Nonparametric Estimation (Springer Science & Business Media, Berlin).Google Scholar
  • Van de Geer SA, van de Geer S (2000) Empirical Processes in M-estimation, vol. 6 (Cambridge University Press, Cambridge, UK).Google Scholar
  • Vanschoren J, van Rijn JN, Bischl B, Torgo L (2013) Openml: Networked science in machine learning. SIGKDD Explorations 15(2):49–60.CrossrefGoogle Scholar
  • Victor H, Giné E (1999) Decoupling: From Dependence to Independence (Springer, New York).Google Scholar
  • Wang YX, Agarwal A, Dudik M (2017) Optimal and adaptive off-policy evaluation in contextual bandits. Internat. Conf. Machine Learn. (PMLR, New York), 3589–3597.Google Scholar
  • Zeng C, Wang Q, Mokhtari S, Li T (2016) Online context-aware recommendation with time varying multi-armed bandit. Proc. 22nd ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (ACM, New York), 2025–2034.Google Scholar
  • Zhan R, Hadad V, Hirshberg DA, Athey S (2021) Off-policy evaluation via adaptive weighting with data from contextual bandits. Proc. 27th ACM SIGKDD Conf. Knowledge Discovery Data Mining (ACM, New York), 2125–2135.Google Scholar
  • Zhang B, Tsiatis AA, Davidian M, Zhang M, Laber E (2012) Estimating optimal treatment regimes from a classification perspective. Statist. 1(1):103–114.CrossrefGoogle Scholar
  • Zhao YQ, Zeng D, Laber EB, Song R, Yuan M, Kosorok MR (2015) Doubly robust learning for estimating individualized treatment with censored data. Biometrika 102(1):151–168.CrossrefGoogle Scholar
  • Zhou X, Mayer-Hamblett N, Khan U, Kosorok MR (2017) Residual weighted learning for estimating individualized treatment rules. J. Amer. Statist. Assoc. 112(517):169–187.CrossrefGoogle Scholar
  • Zhou Z, Athey S, Wager S (2023) Offline multi-action policy learning: Generalization and optimization. Oper. Res. 71(1):148–183.LinkGoogle Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.