Bypassing the Monster: A Faster and Simpler Optimal Algorithm for Contextual Bandits Under Realizability
Published Online:9 Dec 2021https://doi.org/10.1287/moor.2021.1193
References
- [1] (2011) Improved algorithms for linear stochastic bandits. Adv. Neural Inform. Processing Systems 24:2312–2320.Google Scholar
- [2] (1999) Associative reinforcement learning using linear probabilistic concepts. Proc. Internat. Conf. Machine Learn., 3–11.Google Scholar
- [3] (2003) Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica 37(4):263–293.Crossref, Google Scholar
- [4] (2016) Linear contextual bandits with knapsacks. Adv. Neural Inform. Processing Systems 30:3458–3467.Google Scholar
- [5] (2013) Thompson sampling for contextual bandits with linear payoffs. Proc. Internat. Conf. Machine Learning, 127–135.Google Scholar
- [6] (2012) Contextual bandit learning with predictable rewards. Proc. Internat. Conf. Artificial Intelligence and Statistics, 19–26.Google Scholar
- [7] (2014) Taming the monster: A fast and simple algorithm for contextual bandits. Proc. Internat. Conf. Machine Learning, 1638–1646.Google Scholar
- [8] (2016) Making contextual decisions with low technical debt. Preprint, submitted June 13, https://arxiv.org/abs/1606.03966.Google Scholar
- [9] (2002) The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1):48–77.Crossref, Google Scholar
- [10] (2020) Online decision making with high-dimensional covariates. Oper. Res. 68(1):276–294.Link, Google Scholar
- [11] (2011) Contextual bandit algorithms with supervised learning guarantees. Proc. Internat. Conf. Artificial Intelligence and Statistics, 19–26.Google Scholar
- [12] (2018) A contextual bandit bake-off. Preprint, submitted February 12, https://arxiv.org/abs/1802.04064.Google Scholar
- [13] (2014) Regret minimization for reserve prices in second-price auctions. IEEE Trans. Inform. Theory 61(1):549–564.Crossref, Google Scholar
- [14] (2011) An empirical evaluation of thompson sampling. Adv. Neural Inform. Processing Systems 24:2249–2257.Google Scholar
- [15] (2011) Contextual bandits with linear payoff functions. Proc. Internat. Conf. Artificial Intelligence Statistics, 208–214.Google Scholar
- [16] (2011) Efficient optimal learning for contextual bandits. Proc. Conf. Uncertainty in Artificial Intelligence, 169–178.Google Scholar
- [17] (2021) Deep neural networks for estimation and inference. Econometrica 89(1):181–213.Crossref, Google Scholar
- [18] (2010) Parametric bandits: The generalized linear case. Adv. Neural Inform. Processing Systems 23:586–594.Google Scholar
- [19] (2020) Beyond ucb: Optimal and efficient contextual bandits with regression oracles. Proc. Internat. Conf. Machine Learning, 3199–3210.Google Scholar
- [20] (2020) Instance-dependent complexity of contextual bandits and reinforcement learning: A disagreement-based perspective. Proc. Conf. Learning Theory, 2059.Google Scholar
- [21] (2018) Practical contextual bandits with regression oracles. Proc. Internat. Conf. Machine Learning, 1539–1548.Google Scholar
- [22] (2019) Batched multi-armed bandits problem. Adv. Neural Inform. Processing Systems 33:503–513.Google Scholar
- [23] (2016) The computational power of optimization in online learning. Proc. Annual ACM Sympos. Theory of Computing, 128–141.Google Scholar
- [24] (2009) Cryptographic hardness for learning intersections of halfspaces. J. Comput. System Sci. 75(1):2–12.Crossref, Google Scholar
- [25] (2021) Adapting to misspecification in contextual bandits with offline regression oracles. Proc. Internat. Conf. Machine Learning, 5805–5814.Google Scholar
- [26] (2019) Active learning for cost-sensitive classification. J. Machine Learning Res. 20(65):1–50.Google Scholar
- [27] (2008) The epoch-greedy algorithm for multi-armed bandits with side information. Adv. Neural Inform. Processing Systems 21:817–824.Google Scholar
- [28] (2020) Bandit Algorithms (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
- [29] (2020) Learning with good feature representations in bandits and in rl with a generative model. Proc. Internat. Conf. Machine Learning, 5662–5670.Google Scholar
- [30] (2017) Provably optimal algorithms for generalized linear contextual bandits. Proc. Internat. Conf. Machine Learning, 2071–2080.Google Scholar
- [31] (2019) Nearly minimax-optimal regret for linearly parameterized bandits. Proc. Conf. Learning Theory, 2173–2174.Google Scholar
- [32] (2010) A contextual-bandit approach to personalized news article recommendation. Proc. 19th Internat. Conf. World Wide Web, 661–670.Google Scholar
- [33] (2009) Tighter bounds for multi-armed bandits with expert advice. Proc. Conf. Learning Theory.Google Scholar
- [34] (2014) Learning without concentration. Proc. Conf. Learning Theory, 25–39.Google Scholar
- [35] (2016) Batched bandit problems. Ann. Statist. 44(2):660–681.Crossref, Google Scholar
- [36] (2014) Online non-parametric regression. Proc. Conf. Learning Theory, 1232–1264.Google Scholar
- [37] (2017) Empirical entropy, minimax regret and minimax risk. Bernoulli 23(2):789–824.Crossref, Google Scholar
- [38] (2018) A tutorial on thompson sampling. Foundations Trends Machine Learning 11(1):1–96.Crossref, Google Scholar
- [39] (2021) Top-k extreme contextual bandits with arm hierarchy. Proc. Internat. Conf. Machine Learning, 9422–9433.Google Scholar
- [40] (2020) Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability. Preprint, submitted March 28, https://arxiv.org/abs/2003.12699.Google Scholar
- [41] (2019) Introduction to multi-armed bandits. Foundations Trends Machine Learning 12(1-2):1–286.Crossref, Google Scholar
- [42] (2011) An Introduction to Measure Theory, vol. 126 (American Mathematical Society, Providence, RI).Crossref, Google Scholar
- [43] (2017) From ads to interventions: Contextual bandits in mobile health. Mobile Health (Springer, Berlin), 495–517.Google Scholar
- [44] (2019) Comments on the du-kakade-wang-yang lower bounds. Preprint, submitted November 18, https://arxiv.org/abs/1911.07910.Google Scholar
- [45] (2021) Non-stationary reinforcement learning without prior knowledge: An optimal black-box approach. Proc. Conf. Learning Theory, 4300–4354.Google Scholar
- [46] (2020a) Toward optimal problem dependent generalization error bounds in statistical learning theory. Preprint, submitted November 12, https://arxiv.org/abs/2011.06186.Google Scholar
- [47] (2020b) Upper counterfactual confidence bounds: a new optimism principle for contextual bandits. Preprint, submitted July 15, https://arxiv.org/abs/2007.07876.Google Scholar
- [48] (1999) Information-theoretic determination of minimax rates of convergence. Ann. Statist. 27(5):1564–1599.Google Scholar
- [49] (2020) Neural contextual bandits with ucb-based exploration. Proc. Internat. Conf. Machine Learning, 11492–11502.Google Scholar

