Policy Learning with Adaptively Collected Data
References
- (2013) Online learning for linearly parametrized control problems. Unpublished PhD thesis, University of Alberta, Edmonton.Google Scholar
- (2011) Improved algorithms for linear stochastic bandits. Adv. Neural Inform. Processing Systems 24.Google Scholar
- (2022) Personalized recommendations in edtech: Evidence from a randomized controlled trial. Preprint, submitted August 30, https://arxiv.org/abs/2208.13940.Google Scholar
- (2013) Thompson sampling for contextual bandits with linear payoffs. Internat. Conf. Machine Learn. (PMLR, New York), 127–135.Google Scholar
- (1960) Sequential Medical Trials (Blackwell Scientific Publication, Oxford, UK).Google Scholar
- (2021) Policy learning with observational data. Econometrica 89(1):133–161.Crossref, Google Scholar
- (2020) Reinforcement learning for the adaptive scheduling of educational activities. Proc. 2020 CHI Conf. Human Factors Comput. Systems (ACM, New York), 1–12.Google Scholar
- (2021) Predicting with proxies: Transfer learning in high dimension. Management Sci. 67(5):2964–2984.Link, Google Scholar
- (2020) Online decision making with high-dimensional covariates. Oper. Res. 68(1):276–294.Link, Google Scholar
- (2008) Data-adaptive selection of the truncation level for inverse-probability-of-treatment-weighted estimators. Working paper, University of California, Berkeley, Division of Biostatistics, Berkeley, CA.Google Scholar
- (2020) Efficient policy learning from surrogate-loss classification reductions. Internat. Conf. Machine Learn. (PMLR, New York), 788–798.Google Scholar
- (2015) Concentration Inequalities for Sums and Martingales (Springer, New York).Crossref, Google Scholar
- (2017) Personalized diabetes management using electronic medical records. Diabetes Care 40(2):210–217.Crossref, Google Scholar
- (2009) Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Oper. Res. 57(6):1407–1420.Link, Google Scholar
- (2021) Risk minimization from adaptively collected data: Guarantees for supervised and policy learning. Adv. Neural Inform. Processing Systems, vol. 34.Google Scholar
- Bottou L, Peters J, Quiñonero-Candela J, Charles DX, Max Chickering D, Portugaly E, Ray D, Simard P, Snelson Ed (2013) Counterfactual reasoning and learning systems: The example of computational advertising. J. Machine Learning Res. 14(11).Google Scholar
- (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Preprint, submitted April 25, https://arxiv.org/abs/1204.5721.Google Scholar
- (2011) Manifold adaptive experimental design for text categorization. IEEE Trans. Knowledge Data Engrg. 24(4):707–719.Crossref, Google Scholar
- (2019) Semi-parametric efficient policy learning with continuous actions. Adv. Neural Inform. Processing Systems, vol. 32.Google Scholar
- (2011) Contextual bandits with linear payoff functions. Proc. 14th Internat. Conf. Artificial Intelligence Statist. (JMLR Workshop and Conference Proceedings, New York), 208–214.Google Scholar
- (2007) The multiphase optimization strategy (most) and the sequential multiple assignment randomized trial (smart): New methods for more potent ehealth interventions. Amer. J. Preventive Medicine 32(5):S112–S118.Crossref, Google Scholar
- (2008) Stochastic linear optimization under bandit feedback. 21st Annual Conf. Learn. Theory (PMLR, New York).Google Scholar
- (2017) Estimation considerations in contextual bandits. Preprint, submitted November 19, https://arxiv.org/abs/1711.07077.Google Scholar
- (2016) Lecture notes for statistics 311/electrical engineering 377. Accessed February 2023, https://stanford.edu/class/stats311/Lectures/full_notes.pdf.Google Scholar
- (2011) Doubly robust policy evaluation and learning. Preprint, submitted March 23, https://arxiv.org/abs/1103.4601.Google Scholar
- (2019) Learning preferences with side information. Management Sci. 65(7):3131–3149.Link, Google Scholar
- (2018) Objectively measured baseline physical activity patterns in women in the MPED trial: Cluster analysis. JMIR Public Health Surveillance 4(1):e10.Crossref, Google Scholar
- (2013) A linear response bandit problem. Stochastic Systems 3(1):230–261.Link, Google Scholar
- (2021) Confidence intervals for policy evaluation in adaptive experiments. Proc. Natl. Acad. Sci. USA 118(15):e2014602118.Crossref, Google Scholar
- (2016) Bounded off-policy evaluation with missing data for course recommendation and curriculum design. International Conf. Machine Learn. (PMLR, New York), 1596–1604.Google Scholar
- (1952) A generalization of sampling without replacement from a finite universe. J. Amer. Statist. Assoc. 47(260):663–685.Crossref, Google Scholar
- (2004) Nonparametric estimation of average treatment effects under exogeneity: A review. Rev. Econom. Statist. 86(1):4–29.Crossref, Google Scholar
- (2022) Upper bounds on the Natarajan dimensions of some function classes. Preprint, submitted September 15, https://arxiv.org/abs/2209.07015.Google Scholar
- (2020) Is pessimism provably efficient for offline RL? Preprint, submitted December 30, https://arxiv.org/abs/2012.15085.Google Scholar
- (2018) Deep learning with logged bandit feedback. Internat. Conf. Learn. Representations.Google Scholar
- (2018) Balanced policy evaluation and learning. Adv. Neural Inform. Processing Systems, vol. 31.Google Scholar
- (2016) Dynamic assortment personalization in high dimensions. Preprint, submitted October 18, https://arxiv.org/abs/1610.05604.Google Scholar
- (2018) Confounding-robust policy improvement. Preprint, submitted May 22, https://arxiv.org/abs/1805.08593.Google Scholar
- (2018) News recommender systems—Survey and roads ahead. Inform. Processing Management 54(6):1203–1227.Crossref, Google Scholar
- (2011) The battle trial: Personalizing therapy for lung cancer. Cancer Discovery 1(1):44–53.Crossref, Google Scholar
- (2018) Who should be treated? Empirical welfare maximization methods for treatment choice. Econometrica 86(2):591–616.Crossref, Google Scholar
- (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.Crossref, Google Scholar
- (2016) A contextual bandits framework for personalized learning action selection. EDM, 424–429.Google Scholar
- (2020) News recommendation with topic-enriched knowledge graphs. Proc. 29th ACM Internat. Conf. Inform. Knowledge Management, 695–704.Google Scholar
- (2020) Offline reinforcement learning: Tutorial, review, and perspectives on open problems. Preprint, submitted May 4, https://arxiv.org/abs/2005.01643.Google Scholar
- (2017) Provably optimal algorithms for generalized linear contextual bandits. Proc. Internat. Conf. Machine Learning (PMLR, New York), 2071–2080.Google Scholar
- (2010) A contextual-bandit approach to personalized news article recommendation. Proc. 19th Internat. Conf. World Wide Web (ACM, New York), 661–670.Google Scholar
- (2011) Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. Proc. Fourth ACM Internat. Conf. Web Search Data Mining (ACM, New York), 297–306.Google Scholar
- (2016) Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Ann. Statist. 44(2):713–742.Crossref, Google Scholar
- (2014) Offline policy evaluation across representations with applications to educational games. AAMAS, 1077–1084.Google Scholar
- (2003) Optimal dynamic treatment regimes. J. Roy. Statist. Soc. Ser. B Statist. Methodology 65(2):331–355.Crossref, Google Scholar
- (2005) An experimental design for the development of adaptive treatment strategies. Statist. Medicine 24(10):1455–1481.Crossref, Google Scholar
- (2018) Why adaptively collected data have negative bias and how to correct for it. Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 1261–1269.Google Scholar
- (2019) Adaptive experimental design: Prospects and applications in political science. Preprint, submitted June 5, https://dx.doi.org/10.2139/ssrn.3364402.Google Scholar
- (2021) Optimal policies to battle the coronavirus “infodemic” among social media users in Sub-Saharan Africa. OSF Registered Study, Open Science.Google Scholar
- (2015) Sequential complexities and uniform martingale laws of large numbers. Probab. Theory Related Fields 161(1–2):111–153.Crossref, Google Scholar
- (2010) Nonparametric bandits with covariates. Preprint, submitted March 8, https://arxiv.org/abs/1003.1630.Google Scholar
- (1994) Estimation of regression coefficients when some regressors are not always observed. J. Amer. Statist. Assoc. 89(427):846–866.Crossref, Google Scholar
- (2014) Learning to optimize via posterior sampling. Math. Oper. Res. 39(4):1221–1243.Link, Google Scholar
- (2017) A tutorial on Thompson sampling. Preprint, submitted July 7, https://arxiv.org/abs/1707.02038.Google Scholar
- (2020) Off-policy bandits with deficient support. Proc. 26th ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining, 965–975.Google Scholar
- (2019) Shaping feedback data in recommender systems with interventions based on information foraging theory. Proc. 12th ACM Internat. Conf. Web Search Data Mining, 546–554.Google Scholar
- (2020) The impact of more transparent interfaces on behavior in personalized recommendation. Proc. 43rd Internat. ACM SIGIR Conf. Res. Development Inform. Retrieval, 991–1000.Google Scholar
- (2016) Recommendations as treatments: Debiasing learning and evaluation. Internat. Conf. Machine Learn. (PMLR, New York), 1670–1679.Google Scholar
- (2019) Are sample means in multi-armed bandits positively or negatively biased? Preprint, submitted May 27, https://arxiv.org/abs/1905.11397.Google Scholar
- (1977) Adaptive treatment assignment methods and clinical trials. Biometrics 33(4):743–749.Crossref, Google Scholar
- (2020) Doubly robust off-policy evaluation with shrinkage. Internat. Conf. Machine Learn. (PMLR, New York), 9167–9176.Google Scholar
- (2019) CAB: Continuous adaptive blending for policy evaluation and learning. Internat. Conf. Machine Learn. (PMLR, New York), 6005–6014.Google Scholar
- (2020) policytree: Policy learning via doubly robust empirical welfare maximization over trees. J. Open Source Software 5(50):2232.Crossref, Google Scholar
- (2015a) Batch learning from logged bandit feedback through counterfactual risk minimization. J. Machine Learn. Res. 16(1):1731–1755.Google Scholar
- (2015b) Counterfactual risk minimization: Learning from logged bandit feedback. Internat. Conf. Machine Learn. (PMLR, New York), 814–823.Google Scholar
- (2015c) The self-normalized estimator for counterfactual learning. Adv. Neural Inform. Processing Systems, 3231–3239.Google Scholar
- (2016) Off-policy evaluation for slate recommendation. Preprint, submitted May 16, https://arxiv.org/abs/1605.04812.Google Scholar
- (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.Crossref, Google Scholar
- (2008) Introduction to Nonparametric Estimation (Springer Science & Business Media, Berlin).Google Scholar
- (2000) Empirical Processes in M-estimation, vol. 6 (Cambridge University Press, Cambridge, UK).Google Scholar
- (2013) Openml: Networked science in machine learning. SIGKDD Explorations 15(2):49–60.Crossref, Google Scholar
- (1999) Decoupling: From Dependence to Independence (Springer, New York).Google Scholar
- (2017) Optimal and adaptive off-policy evaluation in contextual bandits. Internat. Conf. Machine Learn. (PMLR, New York), 3589–3597.Google Scholar
- (2016) Online context-aware recommendation with time varying multi-armed bandit. Proc. 22nd ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (ACM, New York), 2025–2034.Google Scholar
- (2021) Off-policy evaluation via adaptive weighting with data from contextual bandits. Proc. 27th ACM SIGKDD Conf. Knowledge Discovery Data Mining (ACM, New York), 2125–2135.Google Scholar
- (2012) Estimating optimal treatment regimes from a classification perspective. Statist. 1(1):103–114.Crossref, Google Scholar
- (2015) Doubly robust learning for estimating individualized treatment with censored data. Biometrika 102(1):151–168.Crossref, Google Scholar
- (2017) Residual weighted learning for estimating individualized treatment rules. J. Amer. Statist. Assoc. 112(517):169–187.Crossref, Google Scholar
- (2023) Offline multi-action policy learning: Generalization and optimization. Oper. Res. 71(1):148–183.Link, Google Scholar

