Bypassing the Monster: A Faster and Simpler Optimal Algorithm for Contextual Bandits Under Realizability

Published Online:https://doi.org/10.1287/moor.2021.1193

References

  • [1] Abbasi-Yadkori Y, Pál D, Szepesvári C (2011) Improved algorithms for linear stochastic bandits. Adv. Neural Inform. Processing Systems 24:2312–2320.Google Scholar
  • [2] Abe N, Long P (1999) Associative reinforcement learning using linear probabilistic concepts. Proc. Internat. Conf. Machine Learn., 3–11.Google Scholar
  • [3] Abe N, Biermann A, Long P (2003) Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica 37(4):263–293.CrossrefGoogle Scholar
  • [4] Agrawal S, Devanur NR (2016) Linear contextual bandits with knapsacks. Adv. Neural Inform. Processing Systems 30:3458–3467.Google Scholar
  • [5] Agrawal S, Goyal N (2013) Thompson sampling for contextual bandits with linear payoffs. Proc. Internat. Conf. Machine Learning, 127–135.Google Scholar
  • [6] Agarwal A, Dudík M, Kale S, Langford J, Schapire R (2012) Contextual bandit learning with predictable rewards. Proc. Internat. Conf. Artificial Intelligence and Statistics, 19–26.Google Scholar
  • [7] Agarwal A, Hsu D, Kale S, Langford J, Li L, Schapire R (2014) Taming the monster: A fast and simple algorithm for contextual bandits. Proc. Internat. Conf. Machine Learning, 1638–1646.Google Scholar
  • [8] Agarwal A, Bird S, Cozowicz M, Hoang L, Langford J, Lee S, Li J, et al. (2016) Making contextual decisions with low technical debt. Preprint, submitted June 13, https://arxiv.org/abs/1606.03966.Google Scholar
  • [9] Auer P, Cesa-Bianchi N, Freund Y, Schapire RE (2002) The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1):48–77.CrossrefGoogle Scholar
  • [10] Bastani H, Bayati M (2020) Online decision making with high-dimensional covariates. Oper. Res. 68(1):276–294.LinkGoogle Scholar
  • [11] Beygelzimer A, Langford J, Li L, Reyzin L, Schapire R (2011) Contextual bandit algorithms with supervised learning guarantees. Proc. Internat. Conf. Artificial Intelligence and Statistics, 19–26.Google Scholar
  • [12] Bietti A, Agarwal A, Langford J (2018) A contextual bandit bake-off. Preprint, submitted February 12, https://arxiv.org/abs/1802.04064.Google Scholar
  • [13] Cesa-Bianchi N, Gentile C, Mansour Y (2014) Regret minimization for reserve prices in second-price auctions. IEEE Trans. Inform. Theory 61(1):549–564.CrossrefGoogle Scholar
  • [14] Chapelle O, Li L (2011) An empirical evaluation of thompson sampling. Adv. Neural Inform. Processing Systems 24:2249–2257.Google Scholar
  • [15] Chu W, Li L, Reyzin L, Schapire R (2011) Contextual bandits with linear payoff functions. Proc. Internat. Conf. Artificial Intelligence Statistics, 208–214.Google Scholar
  • [16] Dudik M, Hsu D, Kale S, Karampatziakis N, Langford J, Reyzin L, Zhang T (2011) Efficient optimal learning for contextual bandits. Proc. Conf. Uncertainty in Artificial Intelligence, 169–178.Google Scholar
  • [17] Farrell MH, Liang T, Misra S (2021) Deep neural networks for estimation and inference. Econometrica 89(1):181–213.CrossrefGoogle Scholar
  • [18] Filippi S, Cappe O, Garivier A, Szepesvári C (2010) Parametric bandits: The generalized linear case. Adv. Neural Inform. Processing Systems 23:586–594.Google Scholar
  • [19] Foster D, Rakhlin A (2020) Beyond ucb: Optimal and efficient contextual bandits with regression oracles. Proc. Internat. Conf. Machine Learning, 3199–3210.Google Scholar
  • [20] Foster D, Rakhlin A, Simchi-Levi D, Xu Y (2020) Instance-dependent complexity of contextual bandits and reinforcement learning: A disagreement-based perspective. Proc. Conf. Learning Theory, 2059.Google Scholar
  • [21] Foster D, Agarwal A, Dudik M, Luo H, Schapire R (2018) Practical contextual bandits with regression oracles. Proc. Internat. Conf. Machine Learning, 1539–1548.Google Scholar
  • [22] Gao Z, Han Y, Ren Z, Zhou Z (2019) Batched multi-armed bandits problem. Adv. Neural Inform. Processing Systems 33:503–513.Google Scholar
  • [23] Hazan E, Koren T (2016) The computational power of optimization in online learning. Proc. Annual ACM Sympos. Theory of Computing, 128–141.Google Scholar
  • [24] Klivans AR, Sherstov AA (2009) Cryptographic hardness for learning intersections of halfspaces. J. Comput. System Sci. 75(1):2–12.CrossrefGoogle Scholar
  • [25] Krishnamurthy SK, Hadad V, Athey S (2021) Adapting to misspecification in contextual bandits with offline regression oracles. Proc. Internat. Conf. Machine Learning, 5805–5814.Google Scholar
  • [26] Krishnamurthy A, Agarwal A, Huang TK, Daumé H III, Langford J (2019) Active learning for cost-sensitive classification. J. Machine Learning Res. 20(65):1–50.Google Scholar
  • [27] Langford J, Zhang T (2008) The epoch-greedy algorithm for multi-armed bandits with side information. Adv. Neural Inform. Processing Systems 21:817–824.Google Scholar
  • [28] Lattimore T, Szepesvári C (2020) Bandit Algorithms (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • [29] Lattimore T, Szepesvari C, Weisz G (2020) Learning with good feature representations in bandits and in rl with a generative model. Proc. Internat. Conf. Machine Learning, 5662–5670.Google Scholar
  • [30] Li L, Lu Y, Zhou D (2017) Provably optimal algorithms for generalized linear contextual bandits. Proc. Internat. Conf. Machine Learning, 2071–2080.Google Scholar
  • [31] Li Y, Wang Y, Zhou Y (2019) Nearly minimax-optimal regret for linearly parameterized bandits. Proc. Conf. Learning Theory, 2173–2174.Google Scholar
  • [32] Li L, Chu W, Langford J, Schapire RE (2010) A contextual-bandit approach to personalized news article recommendation. Proc. 19th Internat. Conf. World Wide Web, 661–670.Google Scholar
  • [33] McMahan HB, Streeter M (2009) Tighter bounds for multi-armed bandits with expert advice. Proc. Conf. Learning Theory.Google Scholar
  • [34] Mendelson S (2014) Learning without concentration. Proc. Conf. Learning Theory, 25–39.Google Scholar
  • [35] Perchet V, Rigollet P, Chassang S, Snowberg E (2016) Batched bandit problems. Ann. Statist. 44(2):660–681.CrossrefGoogle Scholar
  • [36] Rakhlin A, Sridharan K (2014) Online non-parametric regression. Proc. Conf. Learning Theory, 1232–1264.Google Scholar
  • [37] Rakhlin A, Sridharan K, Tsybakov AB (2017) Empirical entropy, minimax regret and minimax risk. Bernoulli 23(2):789–824.CrossrefGoogle Scholar
  • [38] Russo D, Van Roy B, Kazerouni A, Osband I, Wen Z (2018) A tutorial on thompson sampling. Foundations Trends Machine Learning 11(1):1–96.CrossrefGoogle Scholar
  • [39] Sen R, Rakhlin A, Ying L, Kidambi R, Foster D, Hill D, Dhillon I (2021) Top-k extreme contextual bandits with arm hierarchy. Proc. Internat. Conf. Machine Learning, 9422–9433.Google Scholar
  • [40] Simchi-Levi D, Xu Y (2020) Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability. Preprint, submitted March 28, https://arxiv.org/abs/2003.12699.Google Scholar
  • [41] Slivkins A (2019) Introduction to multi-armed bandits. Foundations Trends Machine Learning 12(1-2):1–286.CrossrefGoogle Scholar
  • [42] Tao T (2011) An Introduction to Measure Theory, vol. 126 (American Mathematical Society, Providence, RI).CrossrefGoogle Scholar
  • [43] Tewari A, Murphy SA (2017) From ads to interventions: Contextual bandits in mobile health. Mobile Health (Springer, Berlin), 495–517.Google Scholar
  • [44] Van Roy B, Dong S (2019) Comments on the du-kakade-wang-yang lower bounds. Preprint, submitted November 18, https://arxiv.org/abs/1911.07910.Google Scholar
  • [45] Wei CY, Luo H (2021) Non-stationary reinforcement learning without prior knowledge: An optimal black-box approach. Proc. Conf. Learning Theory, 4300–4354.Google Scholar
  • [46] Xu Y, Zeevi A (2020a) Toward optimal problem dependent generalization error bounds in statistical learning theory. Preprint, submitted November 12, https://arxiv.org/abs/2011.06186.Google Scholar
  • [47] Xu Y, Zeevi A (2020b) Upper counterfactual confidence bounds: a new optimism principle for contextual bandits. Preprint, submitted July 15, https://arxiv.org/abs/2007.07876.Google Scholar
  • [48] Yang Y, Barron A (1999) Information-theoretic determination of minimax rates of convergence. Ann. Statist. 27(5):1564–1599.Google Scholar
  • [49] Zhou D, Li L, Gu Q (2020) Neural contextual bandits with ucb-based exploration. Proc. Internat. Conf. Machine Learning, 11492–11502.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.