Delay-Adaptive Learning in Generalized Linear Contextual Bandits
Published Online:8 Mar 2023https://doi.org/10.1287/moor.2023.1358
References
- [1] (2011) Improved algorithms for linear stochastic bandits. Shawe-Taylor J, Zemel R, Bartlett P, Pereira F, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 24 (Curran Associates, Inc., Red Hook, NY), 2312–2320.Google Scholar
- [2] (2017) Linear Thompson sampling revisited. Singh A, Zhu J, eds. Proc. 20th Internat. Conf. Artificial Intelligence Statist. (ML Research Press), 176–184.Google Scholar
- [3] (2013) Further optimal regret bounds for Thompson sampling. Carvalho CM, Ravikumar P, eds. Proc. 16th Internat. Conf. Artificial Intelligence Statist. (ML Research Press), 99–107.Google Scholar
- [4] (2013) Thompson sampling for contextual bandits with linear payoffs. Dasgupta S, McAllester D, eds. Proc. 30th Internat. Conf. Machine Learn. ( JMLR.org), 127–135.Google Scholar
- [5] (2016) An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives. Feldman V, Rakhlin A, Shamir O, eds. Proc. 29th Annual Conf. Learn. Theory (ML Research Press), 4–18.Google Scholar
- [6] (2017) Thompson sampling for the MNL-bandit. Kale S, Shamir O, eds. Proc. Conf. Learn. Theory (ML Research Press), 76–78.Google Scholar
- [7] (2014) Taming the monster: A fast and simple algorithm for contextual bandits. Xing EP, Jebara T, eds. Proc. 31st Internat. Conf. Machine Learn. ( JMLR.org), 1638–1646.Google Scholar
- [8] (2021) Policy learning with observational data. Econometrica 89(1):133–161.Crossref, Google Scholar
- [9] (2020) Online decision making with high-dimensional covariates. Oper. Res. 68(1):276–294.Link, Google Scholar
- [10] (2009) Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Oper. Res. 57(6):1407–1420.Link, Google Scholar
- [11] (2012) Regret Analysis of Stochastic and Nonstochastic Multi-Armed Bandit Problems. Foundations and Trends in Machine Learning, vol. 5 (Now Publishers, Norwell, MA).Crossref, Google Scholar
- [12] (2019) Delay and cooperation in nonstochastic bandits. J. Machine Learn. Res. 20(1):613–650.Google Scholar
- [13] (2014) Modeling delayed feedback in display advertising. Proc. 20th ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (Association for Computing Machinery, New York), 1097–1105.Google Scholar
- [14] (1999) Strong consistency of maximum quasi-likelihood estimators in generalized linear models with fixed and adaptive designs. Ann. Statist. 27(4):1155–1163.Crossref, Google Scholar
- [15] (2011) Adaptive Design Methods in Clinical Trials (Chapman and Hall/CRC, Boca Raton, FL).Crossref, Google Scholar
- [16] (2011) Contextual bandits with linear payoff functions. Gordon G, Dunson D, Dudík M, eds. Proc. 14th Internat. Conf. Artificial Intelligence Statist. (ML Research Press), 208–214.Google Scholar
- [17] (2014) Parallelizing exploration-exploitation tradeoffs in Gaussian process bandit optimization. J. Machine Learn. Res. 15(1):3873–3923.Google Scholar
- [18] (1979) Conjugate priors for exponential families. Ann. Statist. 7(2):269–281.Crossref, Google Scholar
- [19] (2011) Doubly robust policy evaluation and learning. Getoor L, Scheffer T, eds. Proc. 28th Internat. Conf. Machine Learn. (Omnipress, Madison, WI), 1097–1104.Google Scholar
- [20] (2011) Efficient optimal learning for contextual bandits. Proc. 27th Conf. Uncertainty Artificial Intelligence (AUAI Press, Arlington, VA), 169–178.Google Scholar
- [21] (2020) Improved optimistic algorithms for logistic bandits. Internat. Conf. Machine Learn. ( JMLR.org), 3052–3060.Google Scholar
- [22] (2010) Parametric bandits: The generalized linear case. Advances in Neural Information Processing Systems, vol. 23 (Curran Associates, Inc., Red Hook, NY), 586–594.Google Scholar
- [23] (2016) Asymptotic convergence in online learning with unbounded delays. Preprint, submitted April 18, https://arxiv.org/abs/1604.05280.Google Scholar
- [24] (2011) A note on performance limitations in bandit problems with side information. IEEE Trans. Inform. Theory 57(3):1707–1713.Crossref, Google Scholar
- [25] (2018) Big data and the precision medicine revolution. Production Oper. Management 27(9):1647–1664.Crossref, Google Scholar
- [26] (2018) Bernstein’s inequality for general Markov chains. Preprint, submitted May 28, https://arxiv.org/abs/1805.10721.Google Scholar
- [27] (2018) Deep learning with logged bandit feedback. Proc. Sixth Internat. Conf. Learning Representations.Google Scholar
- [28] (2013) Online learning under delayed feedback. Dasgupta S, McAllester D, eds. Proc. 30th Internat. Conf. Machine Learn. ( JMLR.org), 1453–1461.Google Scholar
- [29] (2017) Scalable generalized linear bandits: Online computation and hashing. Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 30 (Curran Associates, Inc., Red Hook, NY), 99–109.Google Scholar
- [30] (2001) Wireless commerce: marketing issues and possibilities. Proc. 34th Annual Hawaii Internat. Conf. System Sci. (Institute of Electrical and Electronics Engineers, Piscataway, NJ).Google Scholar
- [31] (2018) Who should be treated? empirical welfare maximization methods for treatment choice. Econometrica 86(2):591–616.Crossref, Google Scholar
- [32] (2017) Provably optimal algorithms for generalized linear contextual bandits. Precup D, The YW, eds. Proc. 34th Internat. Conf. Machine Learn. ( JMLR.org), 2071–2080.Google Scholar
- [33] (2010) A contextual-bandit approach to personalized news article recommendation. Proc. 19th Internat. Conf. World Wide Web (Association for Computing Machinery, New York), 661–670.Google Scholar
- [34] (2015) The queue method: Handling delay, heuristics, prior data, and evaluation in bandits. Proc. 29th AAAI Conf. Artificial Intelligence (Association for the Advancement of Artificial Intelligence, Palo Alto, CA).Google Scholar
- [35] (2020) Stochastic bandits with arm-dependent delays. Daumé III H, Singh A, eds. Proc. 37th Internat. Conf. Machine Learn. ( JMLR.org), 3348–3356.Google Scholar
- [36] (2005) On-line learning with delayed label feedback. Jain S, Simon HU, Tomita E, eds. Algorithmic Learning Theory (Springer, Berlin), 399–413.Crossref, Google Scholar
- [37] (2010) Online Markov decision processes under bandit feedback. Lafferty J, Williams C, Shawe-Taylor J, Zemel R, Culotta A, eds. Advances in Neural Information Processing Systems, vol. 23 (Curran Associates, Inc., Red Hook, NY), 1804–1812.Google Scholar
- [38] (2018) Bandits with delayed, aggregated anonymous feedback. Dy J, Krause A, eds. Proc. 35th Internat. Conf. Machine Learn. ( JMLR.org), 4105–4113.Google Scholar
- [39] (2015) Online learning with adversarial delays. Cortes C, Lawrence N, Lee D, Sugiyama M, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 28 (Curran Associates, Inc., Red Hook, NY), 1270–1278.Google Scholar
- [40] (2010) Nonparametric bandits with covariates. Proc. ACM Conf. Comput. Learn. Theory (Association for Computing Machinery, New York).Google Scholar
- [41] (2014) Learning to optimize via posterior sampling. Math. Oper. Res. 39(4):1221–1243.Link, Google Scholar
- [42] (2016) An information-theoretic analysis of Thompson sampling. J. Machine Learn. Res. 17(1):2442–2471.Google Scholar
- [43] (2017) Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Sci. 36(4):500–522.Link, Google Scholar
- [44] (2020) Calibrating sales forecast in a pandemic using online non-parametric regression model. Preprint, submitted August 13, http://dx.doi.org/10.2139/ssrn.3670264.Google Scholar
- [45] (2015) Batch learning from logged bandit feedback through counterfactual risk minimization. J. Machine Learn. Res. 16(52):1731–1755.Google Scholar
- [46] (2017) Off-policy evaluation for slate recommendation. Proc. 31st Internat. Conf. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY), 3635–3645.Google Scholar
- [47] (2020) Linear bandits with stochastic delayed feedback. Daumé H, Singh A, eds. Proc. 37th Internat. Conf. Machine Lear. ( JMLR.org), 9712–9721.Google Scholar
- [48] (2017) Personalized healthcare outcome analysis of cardiovascular surgical procedures. Working Paper 1336, Ross School of Business, University of Michigan, Ann Arbor.Google Scholar
- [49] (2002) On delayed prediction of individual sequences. IEEE Trans. Inform. Theory 48(7):1959–1976.Crossref, Google Scholar
- [50] (2022) Offline multi-action policy learning: Generalization and optimization. Oper. Res., ePub ahead of print June 7, https://doi.org/10.1287/opre.2022.2271.Google Scholar
- [51] (2019) Learning in generalized linear contextual bandits with stochastic delays. Advances in Neural Information Processing Systems, vol. 32 (Curran Associates, Inc., Red Hook, NY), 5197–5208.Google Scholar

