Delay-Adaptive Learning in Generalized Linear Contextual Bandits

Published Online:https://doi.org/10.1287/moor.2023.1358

References

  • [1] Abbasi-Yadkori Y, Pál D, Szepesvári C (2011) Improved algorithms for linear stochastic bandits. Shawe-Taylor J, Zemel R, Bartlett P, Pereira F, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 24 (Curran Associates, Inc., Red Hook, NY), 2312–2320.Google Scholar
  • [2] Abeille M, Lazaric A (2017) Linear Thompson sampling revisited. Singh A, Zhu J, eds. Proc. 20th Internat. Conf. Artificial Intelligence Statist. (ML Research Press), 176–184.Google Scholar
  • [3] Agrawal S, Goyal N (2013) Further optimal regret bounds for Thompson sampling. Carvalho CM, Ravikumar P, eds. Proc. 16th Internat. Conf. Artificial Intelligence Statist. (ML Research Press), 99–107.Google Scholar
  • [4] Agrawal S, Goyal N (2013) Thompson sampling for contextual bandits with linear payoffs. Dasgupta S, McAllester D, eds. Proc. 30th Internat. Conf. Machine Learn. ( JMLR.org), 127–135.Google Scholar
  • [5] Agrawal S, Devanur NR, Li L (2016) An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives. Feldman V, Rakhlin A, Shamir O, eds. Proc. 29th Annual Conf. Learn. Theory (ML Research Press), 4–18.Google Scholar
  • [6] Agrawal S, Avadhanula V, Goyal V, Zeevi A (2017) Thompson sampling for the MNL-bandit. Kale S, Shamir O, eds. Proc. Conf. Learn. Theory (ML Research Press), 76–78.Google Scholar
  • [7] Agarwal A, Hsu D, Kale S, Langford J, Li L, Schapire R (2014) Taming the monster: A fast and simple algorithm for contextual bandits. Xing EP, Jebara T, eds. Proc. 31st Internat. Conf. Machine Learn. ( JMLR.org), 1638–1646.Google Scholar
  • [8] Athey S, Wager S (2021) Policy learning with observational data. Econometrica 89(1):133–161.CrossrefGoogle Scholar
  • [9] Bastani H, Bayati M (2020) Online decision making with high-dimensional covariates. Oper. Res. 68(1):276–294.LinkGoogle Scholar
  • [10] Besbes O, Zeevi A (2009) Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Oper. Res. 57(6):1407–1420.LinkGoogle Scholar
  • [11] Bubeck S, Cesa-Bianchi N (2012) Regret Analysis of Stochastic and Nonstochastic Multi-Armed Bandit Problems. Foundations and Trends in Machine Learning, vol. 5 (Now Publishers, Norwell, MA).CrossrefGoogle Scholar
  • [12] Cesa-Bianchi N, Gentile C, Mansour Y (2019) Delay and cooperation in nonstochastic bandits. J. Machine Learn. Res. 20(1):613–650.Google Scholar
  • [13] Chapelle O (2014) Modeling delayed feedback in display advertising. Proc. 20th ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (Association for Computing Machinery, New York), 1097–1105.Google Scholar
  • [14] Chen K, Hu I, Ying Z (1999) Strong consistency of maximum quasi-likelihood estimators in generalized linear models with fixed and adaptive designs. Ann. Statist. 27(4):1155–1163.CrossrefGoogle Scholar
  • [15] Chow SC, Chang M (2011) Adaptive Design Methods in Clinical Trials (Chapman and Hall/CRC, Boca Raton, FL).CrossrefGoogle Scholar
  • [16] Chu W, Li L, Reyzin L, Schapire R (2011) Contextual bandits with linear payoff functions. Gordon G, Dunson D, Dudík M, eds. Proc. 14th Internat. Conf. Artificial Intelligence Statist. (ML Research Press), 208–214.Google Scholar
  • [17] Desautels T, Krause A, Burdick JW (2014) Parallelizing exploration-exploitation tradeoffs in Gaussian process bandit optimization. J. Machine Learn. Res. 15(1):3873–3923.Google Scholar
  • [18] Diaconis P, Ylvisaker D (1979) Conjugate priors for exponential families. Ann. Statist. 7(2):269–281.CrossrefGoogle Scholar
  • [19] Dudík M, Langford J, Li L (2011) Doubly robust policy evaluation and learning. Getoor L, Scheffer T, eds. Proc. 28th Internat. Conf. Machine Learn. (Omnipress, Madison, WI), 1097–1104.Google Scholar
  • [20] Dudik M, Hsu D, Kale S, Karampatziakis N, Langford J, Reyzin L, Zhang T (2011) Efficient optimal learning for contextual bandits. Proc. 27th Conf. Uncertainty Artificial Intelligence (AUAI Press, Arlington, VA), 169–178.Google Scholar
  • [21] Faury L, Abeille M, Calauzènes C, Fercoq O (2020) Improved optimistic algorithms for logistic bandits. Internat. Conf. Machine Learn. ( JMLR.org), 3052–3060.Google Scholar
  • [22] Filippi S, Cappe O, Garivier A, Szepesvári C (2010) Parametric bandits: The generalized linear case. Advances in Neural Information Processing Systems, vol. 23 (Curran Associates, Inc., Red Hook, NY), 586–594.Google Scholar
  • [23] Garrabrant S, Soares N, Taylor J (2016) Asymptotic convergence in online learning with unbounded delays. Preprint, submitted April 18, https://arxiv.org/abs/1604.05280.Google Scholar
  • [24] Goldenshluger A, Zeevi A (2011) A note on performance limitations in bandit problems with side information. IEEE Trans. Inform. Theory 57(3):1707–1713.CrossrefGoogle Scholar
  • [25] Hopp WJ, Li J, Wang G (2018) Big data and the precision medicine revolution. Production Oper. Management 27(9):1647–1664.CrossrefGoogle Scholar
  • [26] Jiang B, Sun Q, Fan J (2018) Bernstein’s inequality for general Markov chains. Preprint, submitted May 28, https://arxiv.org/abs/1805.10721.Google Scholar
  • [27] Joachims T, Swaminathan A, Rijke MD (2018) Deep learning with logged bandit feedback. Proc. Sixth Internat. Conf. Learning Representations.Google Scholar
  • [28] Joulani P, Gyorgy A, Szepesvári C (2013) Online learning under delayed feedback. Dasgupta S, McAllester D, eds. Proc. 30th Internat. Conf. Machine Learn. ( JMLR.org), 1453–1461.Google Scholar
  • [29] Jun KS, Bhargava A, Nowak R, Willett R (2017) Scalable generalized linear bandits: Online computation and hashing. Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 30 (Curran Associates, Inc., Red Hook, NY), 99–109.Google Scholar
  • [30] Kannan P, Chang A-M, Whinston AB (2001) Wireless commerce: marketing issues and possibilities. Proc. 34th Annual Hawaii Internat. Conf. System Sci. (Institute of Electrical and Electronics Engineers, Piscataway, NJ).Google Scholar
  • [31] Kitagawa T, Tetenov A (2018) Who should be treated? empirical welfare maximization methods for treatment choice. Econometrica 86(2):591–616.CrossrefGoogle Scholar
  • [32] Li L, Lu Y, Zhou D (2017) Provably optimal algorithms for generalized linear contextual bandits. Precup D, The YW, eds. Proc. 34th Internat. Conf. Machine Learn. ( JMLR.org), 2071–2080.Google Scholar
  • [33] Li L, Chu W, Langford J, Schapire RE (2010) A contextual-bandit approach to personalized news article recommendation. Proc. 19th Internat. Conf. World Wide Web (Association for Computing Machinery, New York), 661–670.Google Scholar
  • [34] Mandel T, Liu YE, Brunskill E, Popović Z (2015) The queue method: Handling delay, heuristics, prior data, and evaluation in bandits. Proc. 29th AAAI Conf. Artificial Intelligence (Association for the Advancement of Artificial Intelligence, Palo Alto, CA).Google Scholar
  • [35] Manegueu AG, Vernade C, Carpentier A, Valko M (2020) Stochastic bandits with arm-dependent delays. Daumé III H, Singh A, eds. Proc. 37th Internat. Conf. Machine Learn. ( JMLR.org), 3348–3356.Google Scholar
  • [36] Mesterharm C (2005) On-line learning with delayed label feedback. Jain S, Simon HU, Tomita E, eds. Algorithmic Learning Theory (Springer, Berlin), 399–413.CrossrefGoogle Scholar
  • [37] Neu G, Antos A, György A, Szepesvári C (2010) Online Markov decision processes under bandit feedback. Lafferty J, Williams C, Shawe-Taylor J, Zemel R, Culotta A, eds. Advances in Neural Information Processing Systems, vol. 23 (Curran Associates, Inc., Red Hook, NY), 1804–1812.Google Scholar
  • [38] Pike-Burke C, Agrawal S, Szepesvari C, Grunewalder S (2018) Bandits with delayed, aggregated anonymous feedback. Dy J, Krause A, eds. Proc. 35th Internat. Conf. Machine Learn. ( JMLR.org), 4105–4113.Google Scholar
  • [39] Quanrud K, Khashabi D (2015) Online learning with adversarial delays. Cortes C, Lawrence N, Lee D, Sugiyama M, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 28 (Curran Associates, Inc., Red Hook, NY), 1270–1278.Google Scholar
  • [40] Rigollet P, Zeevi A (2010) Nonparametric bandits with covariates. Proc. ACM Conf. Comput. Learn. Theory (Association for Computing Machinery, New York).Google Scholar
  • [41] Russo D, Van Roy B (2014) Learning to optimize via posterior sampling. Math. Oper. Res. 39(4):1221–1243.LinkGoogle Scholar
  • [42] Russo D, Van Roy B (2016) An information-theoretic analysis of Thompson sampling. J. Machine Learn. Res. 17(1):2442–2471.Google Scholar
  • [43] Schwartz EM, Bradlow ET, Fader PS (2017) Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Sci. 36(4):500–522.LinkGoogle Scholar
  • [44] Simchi-Levi D, Sun R, Wu MX, Zhu R (2020) Calibrating sales forecast in a pandemic using online non-parametric regression model. Preprint, submitted August 13, http://dx.doi.org/10.2139/ssrn.3670264.Google Scholar
  • [45] Swaminathan A, Joachims T (2015) Batch learning from logged bandit feedback through counterfactual risk minimization. J. Machine Learn. Res. 16(52):1731–1755.Google Scholar
  • [46] Swaminathan A, Krishnamurthy A, Agarwal A, Dudík M, Langford J, Jose D, Zitouni I (2017) Off-policy evaluation for slate recommendation. Proc. 31st Internat. Conf. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY), 3635–3645.Google Scholar
  • [47] Vernade C, Carpentier A, Lattimore T, Zappella G, Ermis B, Brueckner M (2020) Linear bandits with stochastic delayed feedback. Daumé H, Singh A, eds. Proc. 37th Internat. Conf. Machine Lear. ( JMLR.org), 9712–9721.Google Scholar
  • [48] Wang G, Li J, Hopp WJ (2017) Personalized healthcare outcome analysis of cardiovascular surgical procedures. Working Paper 1336, Ross School of Business, University of Michigan, Ann Arbor.Google Scholar
  • [49] Weinberger MJ, Ordentlich E (2002) On delayed prediction of individual sequences. IEEE Trans. Inform. Theory 48(7):1959–1976.CrossrefGoogle Scholar
  • [50] Zhou Z, Athey S, Wager S (2022) Offline multi-action policy learning: Generalization and optimization. Oper. Res., ePub ahead of print June 7, https://doi.org/10.1287/opre.2022.2271.Google Scholar
  • [51] Zhou Z, Xu R, Blanchet J (2019) Learning in generalized linear contextual bandits with stochastic delays. Advances in Neural Information Processing Systems, vol. 32 (Curran Associates, Inc., Red Hook, NY), 5197–5208.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.