Bypassing the Monster: A Faster and Simpler Optimal Algorithm for Contextual Bandits Under Realizability

David Simchi-Levi
David Simchi-Levi
[email protected]
https://orcid.org/0000-0002-4650-1519
Institute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139;Department of Civil and Environmental Engineering and Operations Research Center, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139;
Search for more papers by this author
,
Yunzong Xu
Yunzong Xu
[email protected]
https://orcid.org/0000-0002-1682-419X
Institute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139;Statistics and Data Science Center, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139
Search for more papers by this author

Institute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139;Department of Civil and Environmental Engineering and Operations Research Center, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139;

Search for more papers by this author

Yunzong Xu

[email protected]

https://orcid.org/0000-0002-1682-419X

Institute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139;Statistics and Data Science Center, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139

Search for more papers by this author

Published Online:9 Dec 2021https://doi.org/10.1287/moor.2021.1193

References

[1] Abbasi-Yadkori Y, Pál D, Szepesvári C (2011) Improved algorithms for linear stochastic bandits. Adv. Neural Inform. Processing Systems 24:2312–2320.Google Scholar
[2] Abe N, Long P (1999) Associative reinforcement learning using linear probabilistic concepts. Proc. Internat. Conf. Machine Learn., 3–11.Google Scholar
[3] Abe N, Biermann A, Long P (2003) Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica 37(4):263–293.Crossref, Google Scholar
[4] Agrawal S, Devanur NR (2016) Linear contextual bandits with knapsacks. Adv. Neural Inform. Processing Systems 30:3458–3467.Google Scholar
[5] Agrawal S, Goyal N (2013) Thompson sampling for contextual bandits with linear payoffs. Proc. Internat. Conf. Machine Learning, 127–135.Google Scholar
[6] Agarwal A, Dudík M, Kale S, Langford J, Schapire R (2012) Contextual bandit learning with predictable rewards. Proc. Internat. Conf. Artificial Intelligence and Statistics, 19–26.Google Scholar
[7] Agarwal A, Hsu D, Kale S, Langford J, Li L, Schapire R (2014) Taming the monster: A fast and simple algorithm for contextual bandits. Proc. Internat. Conf. Machine Learning, 1638–1646.Google Scholar
[8] Agarwal A, Bird S, Cozowicz M, Hoang L, Langford J, Lee S, Li J, et al. (2016) Making contextual decisions with low technical debt. Preprint, submitted June 13, https://arxiv.org/abs/1606.03966.Google Scholar
[9] Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (2002) The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1):48–77.Crossref, Google Scholar
[10] Bastani H, Bayati M (2020) Online decision making with high-dimensional covariates. Oper. Res. 68(1):276–294.Link, Google Scholar
[11] Beygelzimer A, Langford J, Li L, Reyzin L, Schapire R (2011) Contextual bandit algorithms with supervised learning guarantees. Proc. Internat. Conf. Artificial Intelligence and Statistics, 19–26.Google Scholar
[12] Bietti A, Agarwal A, Langford J (2018) A contextual bandit bake-off. Preprint, submitted February 12, https://arxiv.org/abs/1802.04064.Google Scholar
[13] Cesa-Bianchi N, Gentile C, Mansour Y (2014) Regret minimization for reserve prices in second-price auctions. IEEE Trans. Inform. Theory 61(1):549–564.Crossref, Google Scholar
[14] Chapelle O, Li L (2011) An empirical evaluation of thompson sampling. Adv. Neural Inform. Processing Systems 24:2249–2257.Google Scholar
[15] Chu W, Li L, Reyzin L, Schapire R (2011) Contextual bandits with linear payoff functions. Proc. Internat. Conf. Artificial Intelligence Statistics, 208–214.Google Scholar
[16] Dudik M, Hsu D, Kale S, Karampatziakis N, Langford J, Reyzin L, Zhang T (2011) Efficient optimal learning for contextual bandits. Proc. Conf. Uncertainty in Artificial Intelligence, 169–178.Google Scholar
[17] Farrell MH, Liang T, Misra S (2021) Deep neural networks for estimation and inference. Econometrica 89(1):181–213.Crossref, Google Scholar
[18] Filippi S, Cappe O, Garivier A, Szepesvári C (2010) Parametric bandits: The generalized linear case. Adv. Neural Inform. Processing Systems 23:586–594.Google Scholar
[19] Foster D, Rakhlin A (2020) Beyond ucb: Optimal and efficient contextual bandits with regression oracles. Proc. Internat. Conf. Machine Learning, 3199–3210.Google Scholar
[20] Foster D, Rakhlin A, Simchi-Levi D, Xu Y (2020) Instance-dependent complexity of contextual bandits and reinforcement learning: A disagreement-based perspective. Proc. Conf. Learning Theory, 2059.Google Scholar
[21] Foster D, Agarwal A, Dudik M, Luo H, Schapire R (2018) Practical contextual bandits with regression oracles. Proc. Internat. Conf. Machine Learning, 1539–1548.Google Scholar
[22] Gao Z, Han Y, Ren Z, Zhou Z (2019) Batched multi-armed bandits problem. Adv. Neural Inform. Processing Systems 33:503–513.Google Scholar
[23] Hazan E, Koren T (2016) The computational power of optimization in online learning. Proc. Annual ACM Sympos. Theory of Computing, 128–141.Google Scholar
[24] Klivans AR, Sherstov AA (2009) Cryptographic hardness for learning intersections of halfspaces. J. Comput. System Sci. 75(1):2–12.Crossref, Google Scholar
[25] Krishnamurthy SK, Hadad V, Athey S (2021) Adapting to misspecification in contextual bandits with offline regression oracles. Proc. Internat. Conf. Machine Learning, 5805–5814.Google Scholar
[26] Krishnamurthy A, Agarwal A, Huang TK, Daumé H III, Langford J (2019) Active learning for cost-sensitive classification. J. Machine Learning Res. 20(65):1–50.Google Scholar
[27] Langford J, Zhang T (2008) The epoch-greedy algorithm for multi-armed bandits with side information. Adv. Neural Inform. Processing Systems 21:817–824.Google Scholar
[28] Lattimore T, Szepesvári C (2020) Bandit Algorithms (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
[29] Lattimore T, Szepesvari C, Weisz G (2020) Learning with good feature representations in bandits and in rl with a generative model. Proc. Internat. Conf. Machine Learning, 5662–5670.Google Scholar
[30] Li L, Lu Y, Zhou D (2017) Provably optimal algorithms for generalized linear contextual bandits. Proc. Internat. Conf. Machine Learning, 2071–2080.Google Scholar
[31] Li Y, Wang Y, Zhou Y (2019) Nearly minimax-optimal regret for linearly parameterized bandits. Proc. Conf. Learning Theory, 2173–2174.Google Scholar
[32] Li L, Chu W, Langford J, Schapire RE (2010) A contextual-bandit approach to personalized news article recommendation. Proc. 19th Internat. Conf. World Wide Web, 661–670.Google Scholar
[33] McMahan HB, Streeter M (2009) Tighter bounds for multi-armed bandits with expert advice. Proc. Conf. Learning Theory.Google Scholar
[34] Mendelson S (2014) Learning without concentration. Proc. Conf. Learning Theory, 25–39.Google Scholar
[35] Perchet V, Rigollet P, Chassang S, Snowberg E (2016) Batched bandit problems. Ann. Statist. 44(2):660–681.Crossref, Google Scholar
[36] Rakhlin A, Sridharan K (2014) Online non-parametric regression. Proc. Conf. Learning Theory, 1232–1264.Google Scholar
[37] Rakhlin A, Sridharan K, Tsybakov AB (2017) Empirical entropy, minimax regret and minimax risk. Bernoulli 23(2):789–824.Crossref, Google Scholar
[38] Russo D, Van Roy B, Kazerouni A, Osband I, Wen Z (2018) A tutorial on thompson sampling. Foundations Trends Machine Learning 11(1):1–96.Crossref, Google Scholar
[39] Sen R, Rakhlin A, Ying L, Kidambi R, Foster D, Hill D, Dhillon I (2021) Top-k extreme contextual bandits with arm hierarchy. Proc. Internat. Conf. Machine Learning, 9422–9433.Google Scholar
[40] Simchi-Levi D, Xu Y (2020) Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability. Preprint, submitted March 28, https://arxiv.org/abs/2003.12699.Google Scholar
[41] Slivkins A (2019) Introduction to multi-armed bandits. Foundations Trends Machine Learning 12(1-2):1–286.Crossref, Google Scholar
[42] Tao T (2011) An Introduction to Measure Theory, vol. 126 (American Mathematical Society, Providence, RI).Crossref, Google Scholar
[43] Tewari A, Murphy SA (2017) From ads to interventions: Contextual bandits in mobile health. Mobile Health (Springer, Berlin), 495–517.Google Scholar
[44] Van Roy B, Dong S (2019) Comments on the du-kakade-wang-yang lower bounds. Preprint, submitted November 18, https://arxiv.org/abs/1911.07910.Google Scholar
[45] Wei CY, Luo H (2021) Non-stationary reinforcement learning without prior knowledge: An optimal black-box approach. Proc. Conf. Learning Theory, 4300–4354.Google Scholar
[46] Xu Y, Zeevi A (2020a) Toward optimal problem dependent generalization error bounds in statistical learning theory. Preprint, submitted November 12, https://arxiv.org/abs/2011.06186.Google Scholar
[47] Xu Y, Zeevi A (2020b) Upper counterfactual confidence bounds: a new optimism principle for contextual bandits. Preprint, submitted July 15, https://arxiv.org/abs/2007.07876.Google Scholar
[48] Yang Y, Barron A (1999) Information-theoretic determination of minimax rates of convergence. Ann. Statist. 27(5):1564–1599.Google Scholar
[49] Zhou D, Li L, Gu Q (2020) Neural contextual bandits with ucb-based exploration. Proc. Internat. Conf. Machine Learning, 11492–11502.Google Scholar

cover image Mathematics of Operations Research

Volume 47, Issue 3

August 2022

Pages 1707-2545, C2

Article Information

Metrics

Information

Received:April 11, 2020
Accepted:June 11, 2021
Published Online:December 09, 2021

Cite as

David Simchi-Levi, Yunzong Xu (2021) Bypassing the Monster: A Faster and Simpler Optimal Algorithm for Contextual Bandits Under Realizability. Mathematics of Operations Research 47(3):1904-1931.

https://doi.org/10.1287/moor.2021.1193

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Bypassing the Monster: A Faster and Simpler Optimal Algorithm for Contextual Bandits Under Realizability

References

Volume 47, Issue 3

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News