Delay-Adaptive Learning in Generalized Linear Contextual Bandits

Jose Blanchet
Jose Blanchet
[email protected]
Department of Management Science and Engineering, Stanford University, Stanford, California 94305;
Search for more papers by this author
,
Renyuan Xu
Renyuan Xu
[email protected]
https://orcid.org/0000-0003-4293-3450
Epstein Department of Industrial and Systems Engineering, University of Southern California, Los Angeles, California 90089;Mathematical Institute, University of Oxford, Oxford OX2 6GG, United Kingdom;
Search for more papers by this author
,
Zhengyuan Zhou
Corresponding Author
Zhengyuan Zhou
[email protected]
https://orcid.org/0000-0002-0005-9411
Stern School of Business, New York University, New York, New York 10012
Search for more papers by this author

Department of Management Science and Engineering, Stanford University, Stanford, California 94305;

Epstein Department of Industrial and Systems Engineering, University of Southern California, Los Angeles, California 90089;Mathematical Institute, University of Oxford, Oxford OX2 6GG, United Kingdom;

Search for more papers by this author

Zhengyuan Zhou

Corresponding Author

Zhengyuan Zhou

[email protected]

https://orcid.org/0000-0002-0005-9411

Stern School of Business, New York University, New York, New York 10012

Search for more papers by this author

Published Online:8 Mar 2023https://doi.org/10.1287/moor.2023.1358

References

[1] Abbasi-Yadkori Y, Pál D, Szepesvári C (2011) Improved algorithms for linear stochastic bandits. Shawe-Taylor J, Zemel R, Bartlett P, Pereira F, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 24 (Curran Associates, Inc., Red Hook, NY), 2312–2320.Google Scholar
[2] Abeille M, Lazaric A (2017) Linear Thompson sampling revisited. Singh A, Zhu J, eds. Proc. 20th Internat. Conf. Artificial Intelligence Statist. (ML Research Press), 176–184.Google Scholar
[3] Agrawal S, Goyal N (2013) Further optimal regret bounds for Thompson sampling. Carvalho CM, Ravikumar P, eds. Proc. 16th Internat. Conf. Artificial Intelligence Statist. (ML Research Press), 99–107.Google Scholar
[4] Agrawal S, Goyal N (2013) Thompson sampling for contextual bandits with linear payoffs. Dasgupta S, McAllester D, eds. Proc. 30th Internat. Conf. Machine Learn. ( JMLR.org), 127–135.Google Scholar
[5] Agrawal S, Devanur NR, Li L (2016) An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives. Feldman V, Rakhlin A, Shamir O, eds. Proc. 29th Annual Conf. Learn. Theory (ML Research Press), 4–18.Google Scholar
[6] Agrawal S, Avadhanula V, Goyal V, Zeevi A (2017) Thompson sampling for the MNL-bandit. Kale S, Shamir O, eds. Proc. Conf. Learn. Theory (ML Research Press), 76–78.Google Scholar
[7] Agarwal A, Hsu D, Kale S, Langford J, Li L, Schapire R (2014) Taming the monster: A fast and simple algorithm for contextual bandits. Xing EP, Jebara T, eds. Proc. 31st Internat. Conf. Machine Learn. ( JMLR.org), 1638–1646.Google Scholar
[8] Athey S, Wager S (2021) Policy learning with observational data. Econometrica 89(1):133–161.Crossref, Google Scholar
[9] Bastani H, Bayati M (2020) Online decision making with high-dimensional covariates. Oper. Res. 68(1):276–294.Link, Google Scholar
[10] Besbes O, Zeevi A (2009) Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Oper. Res. 57(6):1407–1420.Link, Google Scholar
[11] Bubeck S, Cesa-Bianchi N (2012) Regret Analysis of Stochastic and Nonstochastic Multi-Armed Bandit Problems. Foundations and Trends in Machine Learning, vol. 5 (Now Publishers, Norwell, MA).Crossref, Google Scholar
[12] Cesa-Bianchi N, Gentile C, Mansour Y (2019) Delay and cooperation in nonstochastic bandits. J. Machine Learn. Res. 20(1):613–650.Google Scholar
[13] Chapelle O (2014) Modeling delayed feedback in display advertising. Proc. 20th ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (Association for Computing Machinery, New York), 1097–1105.Google Scholar
[14] Chen K, Hu I, Ying Z (1999) Strong consistency of maximum quasi-likelihood estimators in generalized linear models with fixed and adaptive designs. Ann. Statist. 27(4):1155–1163.Crossref, Google Scholar
[15] Chow SC, Chang M (2011) Adaptive Design Methods in Clinical Trials (Chapman and Hall/CRC, Boca Raton, FL).Crossref, Google Scholar
[16] Chu W, Li L, Reyzin L, Schapire R (2011) Contextual bandits with linear payoff functions. Gordon G, Dunson D, Dudík M, eds. Proc. 14th Internat. Conf. Artificial Intelligence Statist. (ML Research Press), 208–214.Google Scholar
[17] Desautels T, Krause A, Burdick JW (2014) Parallelizing exploration-exploitation tradeoffs in Gaussian process bandit optimization. J. Machine Learn. Res. 15(1):3873–3923.Google Scholar
[18] Diaconis P, Ylvisaker D (1979) Conjugate priors for exponential families. Ann. Statist. 7(2):269–281.Crossref, Google Scholar
[19] Dudík M, Langford J, Li L (2011) Doubly robust policy evaluation and learning. Getoor L, Scheffer T, eds. Proc. 28th Internat. Conf. Machine Learn. (Omnipress, Madison, WI), 1097–1104.Google Scholar
[20] Dudik M, Hsu D, Kale S, Karampatziakis N, Langford J, Reyzin L, Zhang T (2011) Efficient optimal learning for contextual bandits. Proc. 27th Conf. Uncertainty Artificial Intelligence (AUAI Press, Arlington, VA), 169–178.Google Scholar
[21] Faury L, Abeille M, Calauzènes C, Fercoq O (2020) Improved optimistic algorithms for logistic bandits. Internat. Conf. Machine Learn. ( JMLR.org), 3052–3060.Google Scholar
[22] Filippi S, Cappe O, Garivier A, Szepesvári C (2010) Parametric bandits: The generalized linear case. Advances in Neural Information Processing Systems, vol. 23 (Curran Associates, Inc., Red Hook, NY), 586–594.Google Scholar
[23] Garrabrant S, Soares N, Taylor J (2016) Asymptotic convergence in online learning with unbounded delays. Preprint, submitted April 18, https://arxiv.org/abs/1604.05280.Google Scholar
[24] Goldenshluger A, Zeevi A (2011) A note on performance limitations in bandit problems with side information. IEEE Trans. Inform. Theory 57(3):1707–1713.Crossref, Google Scholar
[25] Hopp WJ, Li J, Wang G (2018) Big data and the precision medicine revolution. Production Oper. Management 27(9):1647–1664.Crossref, Google Scholar
[26] Jiang B, Sun Q, Fan J (2018) Bernstein’s inequality for general Markov chains. Preprint, submitted May 28, https://arxiv.org/abs/1805.10721.Google Scholar
[27] Joachims T, Swaminathan A, Rijke MD (2018) Deep learning with logged bandit feedback. Proc. Sixth Internat. Conf. Learning Representations.Google Scholar
[28] Joulani P, Gyorgy A, Szepesvári C (2013) Online learning under delayed feedback. Dasgupta S, McAllester D, eds. Proc. 30th Internat. Conf. Machine Learn. ( JMLR.org), 1453–1461.Google Scholar
[29] Jun KS, Bhargava A, Nowak R, Willett R (2017) Scalable generalized linear bandits: Online computation and hashing. Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 30 (Curran Associates, Inc., Red Hook, NY), 99–109.Google Scholar
[30] Kannan P, Chang A-M, Whinston AB (2001) Wireless commerce: marketing issues and possibilities. Proc. 34th Annual Hawaii Internat. Conf. System Sci. (Institute of Electrical and Electronics Engineers, Piscataway, NJ).Google Scholar
[31] Kitagawa T, Tetenov A (2018) Who should be treated? empirical welfare maximization methods for treatment choice. Econometrica 86(2):591–616.Crossref, Google Scholar
[32] Li L, Lu Y, Zhou D (2017) Provably optimal algorithms for generalized linear contextual bandits. Precup D, The YW, eds. Proc. 34th Internat. Conf. Machine Learn. ( JMLR.org), 2071–2080.Google Scholar
[33] Li L, Chu W, Langford J, Schapire RE (2010) A contextual-bandit approach to personalized news article recommendation. Proc. 19th Internat. Conf. World Wide Web (Association for Computing Machinery, New York), 661–670.Google Scholar
[34] Mandel T, Liu YE, Brunskill E, Popović Z (2015) The queue method: Handling delay, heuristics, prior data, and evaluation in bandits. Proc. 29th AAAI Conf. Artificial Intelligence (Association for the Advancement of Artificial Intelligence, Palo Alto, CA).Google Scholar
[35] Manegueu AG, Vernade C, Carpentier A, Valko M (2020) Stochastic bandits with arm-dependent delays. Daumé III H, Singh A, eds. Proc. 37th Internat. Conf. Machine Learn. ( JMLR.org), 3348–3356.Google Scholar
[36] Mesterharm C (2005) On-line learning with delayed label feedback. Jain S, Simon HU, Tomita E, eds. Algorithmic Learning Theory (Springer, Berlin), 399–413.Crossref, Google Scholar
[37] Neu G, Antos A, György A, Szepesvári C (2010) Online Markov decision processes under bandit feedback. Lafferty J, Williams C, Shawe-Taylor J, Zemel R, Culotta A, eds. Advances in Neural Information Processing Systems, vol. 23 (Curran Associates, Inc., Red Hook, NY), 1804–1812.Google Scholar
[38] Pike-Burke C, Agrawal S, Szepesvari C, Grunewalder S (2018) Bandits with delayed, aggregated anonymous feedback. Dy J, Krause A, eds. Proc. 35th Internat. Conf. Machine Learn. ( JMLR.org), 4105–4113.Google Scholar
[39] Quanrud K, Khashabi D (2015) Online learning with adversarial delays. Cortes C, Lawrence N, Lee D, Sugiyama M, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 28 (Curran Associates, Inc., Red Hook, NY), 1270–1278.Google Scholar
[40] Rigollet P, Zeevi A (2010) Nonparametric bandits with covariates. Proc. ACM Conf. Comput. Learn. Theory (Association for Computing Machinery, New York).Google Scholar
[41] Russo D, Van Roy B (2014) Learning to optimize via posterior sampling. Math. Oper. Res. 39(4):1221–1243.Link, Google Scholar
[42] Russo D, Van Roy B (2016) An information-theoretic analysis of Thompson sampling. J. Machine Learn. Res. 17(1):2442–2471.Google Scholar
[43] Schwartz EM, Bradlow ET, Fader PS (2017) Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Sci. 36(4):500–522.Link, Google Scholar
[44] Simchi-Levi D, Sun R, Wu MX, Zhu R (2020) Calibrating sales forecast in a pandemic using online non-parametric regression model. Preprint, submitted August 13, http://dx.doi.org/10.2139/ssrn.3670264.Google Scholar
[45] Swaminathan A, Joachims T (2015) Batch learning from logged bandit feedback through counterfactual risk minimization. J. Machine Learn. Res. 16(52):1731–1755.Google Scholar
[46] Swaminathan A, Krishnamurthy A, Agarwal A, Dudík M, Langford J, Jose D, Zitouni I (2017) Off-policy evaluation for slate recommendation. Proc. 31st Internat. Conf. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY), 3635–3645.Google Scholar
[47] Vernade C, Carpentier A, Lattimore T, Zappella G, Ermis B, Brueckner M (2020) Linear bandits with stochastic delayed feedback. Daumé H, Singh A, eds. Proc. 37th Internat. Conf. Machine Lear. ( JMLR.org), 9712–9721.Google Scholar
[48] Wang G, Li J, Hopp WJ (2017) Personalized healthcare outcome analysis of cardiovascular surgical procedures. Working Paper 1336, Ross School of Business, University of Michigan, Ann Arbor.Google Scholar
[49] Weinberger MJ, Ordentlich E (2002) On delayed prediction of individual sequences. IEEE Trans. Inform. Theory 48(7):1959–1976.Crossref, Google Scholar
[50] Zhou Z, Athey S, Wager S (2022) Offline multi-action policy learning: Generalization and optimization. Oper. Res., ePub ahead of print June 7, https://doi.org/10.1287/opre.2022.2271.Google Scholar
[51] Zhou Z, Xu R, Blanchet J (2019) Learning in generalized linear contextual bandits with stochastic delays. Advances in Neural Information Processing Systems, vol. 32 (Curran Associates, Inc., Red Hook, NY), 5197–5208.Google Scholar

cover image Mathematics of Operations Research

Volume 49, Issue 1

February 2024

Pages 1-651, C2

Article Information

Metrics

Information

Received:April 08, 2021
Accepted:March 06, 2022
Published Online:March 08, 2023

Cite as

Jose Blanchet, Renyuan Xu, Zhengyuan Zhou (2023) Delay-Adaptive Learning in Generalized Linear Contextual Bandits. Mathematics of Operations Research 49(1):326-345.

https://doi.org/10.1287/moor.2023.1358

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Delay-Adaptive Learning in Generalized Linear Contextual Bandits

References

Volume 49, Issue 1

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News