Dynamic Batch Learning in High-Dimensional Sparse Linear Contextual Bandits

Zhimei Ren
Zhimei Ren
[email protected]
https://orcid.org/0000-0002-2872-5842
Department of Statistics and Data Science, The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104;
Search for more papers by this author
,
Zhengyuan Zhou
Corresponding Author
Zhengyuan Zhou
[email protected]
https://orcid.org/0000-0002-0005-9411
Department of Technology, Operation and Statistics, Stern School of Business, New York University, New York, New York 10012
Search for more papers by this author

Department of Statistics and Data Science, The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104;

Search for more papers by this author

Zhengyuan Zhou

Corresponding Author

Zhengyuan Zhou

[email protected]

https://orcid.org/0000-0002-0005-9411

Department of Technology, Operation and Statistics, Stern School of Business, New York University, New York, New York 10012

Search for more papers by this author

Published Online:19 Oct 2023https://doi.org/10.1287/mnsc.2023.4895

References

Abbasi-Yadkori Y (2012) Online learning for linearly parametrized control problems. Ph.D. Dissertation. University of Alberta, CAN. Advisor(s) Csaba Szepesvari.Google Scholar
Agrawal S, Goyal N (2013a) Further optimal regret bounds for Thompson sampling. Carvalho CM, Ravikumar P, eds. Artificial Intelligence and Statistics (PMLR, Cambridge, MA), 99–107.Google Scholar
Agrawal S, Goyal N (2013b) Thompson sampling for contextual bandits with linear payoffs. Dasgupta S, McAllester D, eds. Proc. Internat. Conf. on Machine Learn. (PMLR, Cambridge, MA), 127–135.Google Scholar
Auer P (2002) Using confidence bounds for exploitation-exploration trade-offs. J. Machine Learn. Res. 3(Nov):397–422.Google Scholar
Ban GY, Rudin C (2019) The big data newsvendor: Practical insights from machine learning. Oper. Res. 67(1):90–108.Link, Google Scholar
Bastani H, Bayati M (2020) Online decision making with high-dimensional covariates. Oper. Res. 68(1):276–294.Link, Google Scholar
Bastani H, Bastani O, Kim C (2017) Interpreting predictive models for human-in-the-loop analytics. Preprint, submitted May 23, https://arxiv.org/abs/ 1705.08504.Google Scholar
Bastani H, Bayati M, Khosravi K (2021) Mostly exploration-free algorithms for contextual bandits. Management Sci. 67(3):1329–1349.Link, Google Scholar
Bayati M, Braverman M, Gillam M, Mack KM, Ruiz G, Smith MS, Horvitz E (2014) Data-driven decisions for reducing readmissions for heart failure: General methodology and case study. PLoS One 9(10):e109264.Crossref, Google Scholar
Belloni A, Chernozhukov V (2011) High dimensional sparse econometric models: An introduction. Inverse Problems and High-Dimensional Estimation (Springer, Berlin), 121–156.Crossref, Google Scholar
Belloni A, Chernozhukov V, Hansen C (2014) Inference on treatment effects after selection among high-dimensional controls. Rev. Econom. Stud. 81(2):608–650.Crossref, Google Scholar
Bertsimas D, Mersereau AJ (2007) A learning approach for interactive marketing to a customer segment. Oper. Res. 55(6):1120–1135.Link, Google Scholar
Bickel PJ, Ritov Y, Tsybakov AB (2009) Simultaneous analysis of lasso and dantzig selector. Ann. Statist. 37(4):1705–1732.Crossref, Google Scholar
Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learn. 5(1):1–122.Google Scholar
Carpentier A, Munos R (2012) Bandit theory meets compressed sensing for high dimensional stochastic linear bandit. Lawrence ND, Girolami M, eds. Artificial Intelligence and Statistics (PMLR, Cambridge, MA), 190–198.Google Scholar
Chow SC, Chang M (2008) Adaptive design methods in clinical trials—A review. Orphanet J. Rare Diseases 3(1):1–13.Crossref, Google Scholar
Chu W, Li L, Reyzin L, Schapire R (2011) Contextual bandits with linear payoff functions. Gordon G, Dunson D, Dudík M, eds. Proc. 14th Internat. Conf. on Artificial Intelligence and Statist. (PMLR, Cambridge, MA), 208–214.Google Scholar
Cover TM, Thomas JA (2006) Elements of Information Theory, 2nd ed. (Wiley, New York).Google Scholar
Ferreira KJ, Simchi-Levi D, Wang H (2018) Online network revenue management using thompson sampling. Oper. Res. 66(6):1586–1602.Link, Google Scholar
Filippi S, Cappe O, Garivier A, Szepesvári C (2010) Parametric bandits: The generalized linear case. Lafferty J, Williams C, Shawe-Taylor J, Zemel R, Culotta A, eds. Adv. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY), 586–594.Google Scholar
Gao Z, Han Y, Ren Z, Zhou Z (2019) Batched multi-armed bandits problem. Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, eds. Adv. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY), 501–511.Google Scholar
Goldenshluger A, Zeevi A (2013) A linear response bandit problem. Stochastic Systems 3(1):230–261.Link, Google Scholar
Han Y, Zhou Z, Zhou Z, Blanchet J, Glynn PW, Ye Y (2020) Sequential batch learning in finite-action linear contextual bandits. Preprint, submitted April 14, https://arxiv.org/abs/ 2004.06321.Google Scholar
Hastie T, Tibshirani R, Wainwright M (2015) Statistical Learning with Sparsity: The Lasso and Generalizations (CRC Press, Boca Raton, FL).Crossref, Google Scholar
Hopp WJ, Li J, Wang G (2018) Big data and the precision medicine revolution. Production Oper. Management 27(9):1647–1664.Crossref, Google Scholar
Joachims T, Swaminathan A, Rijke MD (2018) Deep learning with logged bandit feedback. Bengio Y, LeCun Y, eds. Proc. Internat. Conf. on Learn. Representations (OpenReview, Amherst, MA).Google Scholar
Kallus N, Zhou A (2018) Confounding-robust policy improvement. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Advances in neural information processing systems 31 (Curran Associates, Inc, Red Hook, NY).Google Scholar
Kim ES, Herbst RS, Wistuba II, Lee JJ, Blumenschein GR, Tsao A, Stewart DJ, et al. (2011) The battle trial: Personalizing therapy for lung cancer. Cancer Discovery 1(1):44–53.Crossref, Google Scholar
Kim GS, Paik MC (2019) Doubly-robust lasso bandit. Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, eds. Adv. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY), 5877–5887.Google Scholar
Kitagawa T, Tetenov A (2018) Who should be treated? Empirical welfare maximization methods for treatment choice. Econometrica 86(2):591–616.Crossref, Google Scholar
Lattimore T, Szepesvári C (2020) Bandit algorithms (Cambridge University Press, Cambridge, UK).Google Scholar
Li K, Yang Y, Narisetty NN (2021) Regret lower bound and optimal algorithm for high-dimensional contextual linear bandit. Electronic J. Statist. 15(2):5652–5695.Google Scholar
Miao S, Chao X (2019) Fast algorithms for online personalized assortment optimization in a big data regime. Preprint, submitted August 5, https://dx.doi.org/10.2139/ssrn.3432574.Google Scholar
Mintz Y, Aswani A, Kaminsky P, Flowers E, Fukuoka Y (2017) Behavioral analytics for myopic agents. Preprint, submitted February 17, https://arxiv.org/abs/ 1702.05496.Google Scholar
Mintz Y, Aswani A, Kaminsky P, Flowers E, Fukuoka Y (2020) Nonstationary bandits with habituation and recovery dynamics. Oper. Res.Link, Google Scholar
Naik P, Wedel M, Bacon L, Bodapati A, Bradlow E, Kamakura W, Kreulen J, et al. (2008) Challenges and opportunities in high-dimensional choice data analyses. Marketing Lett. 19(3–4):201.Crossref, Google Scholar
Oh Mh, Iyengar G, Zeevi A (2021) Sparsity-agnostic lasso bandit. Meila M, Zhang Tong, eds. Proc. Internat. Conf. on Machine Learn. (PMLR, Cambridge, MA), 8271–8280.Google Scholar
Pallmann P, Bedding AW, Choodari-Oskooei B, Dimairo M, Flight L, Hampson LV, Holmes J, et al. (2018) Adaptive designs in clinical trials: Why use them, and how to run and report them. BMC Medicine 16(1):1–15.Crossref, Google Scholar
Perchet V, Rigollet P, Chassang S, Snowberg E (2016) Batched bandit problems. Ann. Statist. 44(2):660–681.Crossref, Google Scholar
Razavian N, Blecker S, Schmidt AM, Smith-McLallen A, Nigam S, Sontag D (2015) Population-level prediction of type 2 diabetes from claims data and analysis of risk factors. Big Data 3(4):277–287.Crossref, Google Scholar
Rigollet P (2015) 18. s997: High Dimensional Statistics (MIT Open-CourseWare, Cambridge, MA).Google Scholar
Rigollet P, Zeevi A (2010) Nonparametric bandits with covariates. Kalai AT, Mohri M, eds. Conference On Learning Theory (Omnipress, Norristown, PA), 54–66.Google Scholar
Robbins H (1952) Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. (New Series) 58(5):527–535.Crossref, Google Scholar
Rudelson M, Vershynin R (2015) Small ball probabilities for linear images of high-dimensional distributions. Internat. Math. Res. Not. IMRN 2015(19):9594–9617.Crossref, Google Scholar
Russo D, Van Roy B (2016) An information-theoretic analysis of Thompson sampling. J. Machine Learn. Res. 17(1):2442–2471.Google Scholar
Schwartz EM, Bradlow ET, Fader PS (2017) Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Sci. 36(4):500–522.Link, Google Scholar
Slivkins A (2019) Introduction to multi-armed bandits. Foundations Trends Machine Learn. 12(1–2):1–286.Crossref, Google Scholar
Swaminathan A, Joachims T (2015) Batch learning from logged bandit feedback through counterfactual risk minimization. J. Machine Learn. Res. 16(52):1731–1755.Google Scholar
Wainwright MJ (2019) High-Dimensional Statistics: A Non-Asymptotic Viewpoint, vol. 48 (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
Wang X, Wei M, Yao T (2018) Minimax concave penalized multi-armed bandit model with high-dimensional covariates. Proc. Internat. Conf. on Machine Learn., 5200–5208.Google Scholar
Zhao YQ, Zeng D, Laber EB, Song R, Yuan M, Kosorok MR (2014) Doubly robust learning for estimating individualized treatment with censored data. Biometrika 102(1):151–168.Crossref, Google Scholar
Zhou M, Fukuoka Y, Mintz Y, Goldberg K, Kaminsky P, Flowers E, Aswani A (2018) Evaluating machine learning–based automated personalized daily step goals delivered through a mobile phone app: Randomized controlled trial. JMIR Mhealth Uhealth 6(1):e28.Crossref, Google Scholar

Volume 70, Issue 2

February 2024

Pages v-vi, 671-1342, iii-iv

Article Information

Metrics

Information

Received:May 10, 2021
Accepted:October 10, 2022
Published Online:October 19, 2023

Cite as

Zhimei Ren, Zhengyuan Zhou (2023) Dynamic Batch Learning in High-Dimensional Sparse Linear Contextual Bandits. Management Science 70(2):1315-1342.

https://doi.org/10.1287/mnsc.2023.4895

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Dynamic Batch Learning in High-Dimensional Sparse Linear Contextual Bandits

References

Volume 70, Issue 2

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News