Provably Efficient Reinforcement Learning with Linear Function Approximation
References
- [1] (2011) Regret bounds for the adaptive control of linear quadratic systems. Proc. 24th Annual Conf. Learn. Theory (JMLR.org, Cambridge, MA), 1–26.Google Scholar
- [2] (2019) Model-free linear quadratic control via reduction to expert prediction. Chaudhuri K, Sugiyama M, eds. Proc. 22nd Internat. Conf. Artificial Intelligence Statist. (Society for Artificial Intelligence and Statistics, New Jersey), 3108–3117.Google Scholar
- [3] (2011) Improved algorithms for linear stochastic bandits. Shawe-Taylor J, R. Zemel R, Bartlett P, Pereira F, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 24 (Curran Associates, Red Hook, NY), 2312–2320.Google Scholar
- [4] (2018) Improved regret bounds for Thompson sampling in linear quadratic control problems. Dy J, Krause A, eds. Proc. 35th Internat. Conf. Machine Learn. (JMLR.org, Cambridge, MA), 1–9.Google Scholar
- [5] (2020) FLAMBE: Structural complexity and representation learning of low rank MDPs. Preprint, submitted June 18, https://arxiv.org/abs/2006.10814.Google Scholar
- [6] (2002) Using confidence bounds for exploitation-exploration trade-offs. J. Machine Learn. Res. 3(November):397–422.Google Scholar
- [7] (2012) On the sample complexity of reinforcement learning with a generative model. Preprint, submitted June 27, https://arxiv.org/abs/1206.6461.Google Scholar
- [8] (2017) Minimax regret bounds for reinforcement learning. Precup D, Teh YW, eds. Proc. 34th Internat. Conf. Machine Learn. (JMLR.org, Cambridge, MA), 263–272.Google Scholar
- [9] (2011) Speedy Q-learning. Shawe-Taylor J, R. Zemel R, Bartlett P, Pereira F, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 24 (MIT Press, Cambridge, MA), 2411–2419.Google Scholar
- [10] (2018) Efficient exploration through Bayesian deep q-networks. Proc. 2018 Inform. Theory Appl. Workshop (Institute of Electrical and Electronics Engineers, Piscataway, NJ), 1–9.Google Scholar
- [11] (1995) Residual algorithms: Reinforcement learning with function approximation. Proc. 12th Internat. Conf. Machine Learn. (Morgan Kaufmann, Burlington, MA), 30–37.Google Scholar
- [12] (1994) Generalization in reinforcement learning: Safely approximating the value function. Tesauro G, Touretzky D, Leen T, eds. Advances in Neural Information Processing Systems, vol. 7 (MIT Press, Cambridge, MA), 369–376.Google Scholar
- [13] (1996) Linear least-squares algorithms for temporal difference learning. Machine Learn. 22(1–3):33–57.Crossref, Google Scholar
- [14] (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learn. 5(1):1–122.Crossref, Google Scholar
- [15] (2011) Contextual bandits with linear payoff functions. Gordon G, Dunson D, Dudík M, eds. Proc. 14th Internat. Conf. Artificial Intelligence Statist. (AISTATS) (Society for Artificial Intelligence and Statistics, New Jersey), 208–214.Google Scholar
- [16] (2019) Learning linear-quadratic regulators efficiently with only T regret. Preprint, submitted February 17, https://arxiv.org/abs/1902.06223.Google Scholar
- [17] (2008) Stochastic linear optimization under bandit feedback. Servedio R, Zhang T, eds. Proc. 21st Annual Conf. Learn. Theory (Association for Computational Learning, Mountain View, CA), 355–366.Google Scholar
- [18] (2015) Sample complexity of episodic fixed-horizon reinforcement learning. Cortes C, Lawrence N, Lee D, Sugiyama M, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 28 (MIT Press, Cambridge, MA), 2818–2826.Google Scholar
- [19] (2017) Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 30 (Curran Associates, Red Hook, NY), 5713–5723.Google Scholar
- [20] (2019) Policy certificates: Toward accountable reinforcement learning. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learn (Microtome, Brookline, MA), 1507–1516.Google Scholar
- [21] (2018) On oracle-efficient PAC RL with rich observations. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 31 (Curran Associates, Red Hook, NY), 1422–1432.Google Scholar
- [22] (2018) Regret bounds for robust adaptive control of the linear quadratic regulator. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 31 (Curran Associates, Red Hook, NY), 4188–4197.Google Scholar
- [23] (2020) n-regret for learning in Markov decision processes with function approximation and low Bellman rank. Abernethy J, Agarwal S, eds. Proc. 33rd Conf. Learn. Theory (Proceedings of Machine Learning Research), 1554–1557.Google Scholar
- [24] (2019a) Is a good representation sufficient for sample efficient reinforcement learning? Preprint, submitted October 7, https://arxiv.org/abs/1910.03016.Google Scholar
- [25] (2019b) Provably efficient Q-learning with function approximation via distribution shift error checking oracle. Preprint, submitted June 14, https://arxiv.org/abs/1906.06321.Google Scholar
- [26] (2010) Near-optimal regret bounds for reinforcement learning. J. Machine Learn. Res. 11(4):1563–1600.Google Scholar
- [27] (2017) Contextual decision processes with low Bellman rank are PAC-learnable. Precup D, Teh YW, eds. Proc. 34th Internat. Conf. Machine Learn. (JMLR.org, Cambridge, MA), 1704–1713.Google Scholar
- [28] (2018) Is Q-learning provably efficient? Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 31 (Curran Associates, Red Hook, NY), 4863–4873.Google Scholar
- [29] (2002) Near-optimal reinforcement learning in polynomial time. Machine Learn. 49(2–3):209–232.Crossref, Google Scholar
- [30] Kober J, Peters J (2012) Reinforcement learning in robotics: A survey. Wiering M, van Otterlo M, eds. Reinforcement Learning: State-of-the-Art. Adaptation, Learning, and Optimization, vol. 12 (Springer, Berlin), 579–610.Google Scholar
- [31] (1993) Complexity analysis of real-time reinforcement learning. Fikes R, Lehnert WG, eds. Proc. 11th Natl. Conf. Artificial Intelligence (Association for the Advancement of Artificial Intelligence, Menlo Park, CA), 99–107.Google Scholar
- [32] (2012) PAC bounds for discounted MDPs. Bshouty NH, Stoltz G, Vayatis N, Zeugmann T, eds. Proc. 23rd Internat. Conf. Algorithmic Learn. Theory (Springer, Berlin), 320–334.Google Scholar
- [33] (2020) Bandit Algorithms (Cambridge University Press, Cambridge, UK).Google Scholar
- [34] (2016) Deep reinforcement learning for dialogue generation. Preprint, submitted June 5, https://arxiv.org/abs/1606.01541.Google Scholar
- [35] (2010) A contextual-bandit approach to personalized news article recommendation. Proc. 19th Internat. Conf. World Wide Web (Association for Computing Machinery, New York), 661–670.Google Scholar
- [36] (2007) Q-learning with linear function approximation. Bshouty NH, Gentile C, eds. Proc. 20th Annual Conf. Learn. Theory (Springer, Berlin), 308–322.Google Scholar
- [37] (2013) Playing Atari with deep reinforcement learning. Preprint, submitted December 19, https://arxiv.org/abs/1312.5602.Google Scholar
- [38] (2016) Asynchronous methods for deep reinforcement learning. Balcan MF, Weinberger KQ, eds. Proc. 33rd Internat. Conf. Machine Learn. (Curran Associates, Red Hook, NY), 1928–1937.Google Scholar
- [39] (2016) On lower bounds for regret in reinforcement learning. Preprint, submitted August 9, https://arxiv.org/abs/1608.02732.Google Scholar
- [40] (2016) Generalization and exploration via randomized value functions. Balcan MF, Weinberger KQ, eds. Proc. 33rd Internat. Conf. Machine Learn. (Curran Associates, Red Hook, NY), 2377–2386.Google Scholar
- [41] (2019) Deep exploration via randomized value functions. J. Machine Learn. Res. 20(124):1–62.Google Scholar
- [42] (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, Hoboken, NJ).Google Scholar
- [43] (2010) Linearly parameterized bandits. Math. Oper. Res. 35(2):395–411.Link, Google Scholar
- [44] (2019) Worst-case regret bounds for exploration via randomized value functions. Chaudhuri K, Salakhutdinov R, eds. Proc. 33rd Internat. Conf. Neural Inform. Processing Systems (Curran Associates, Red Hook, NY), 14433–14443.Google Scholar
- [45] (2015) Trust region policy optimization. Bach F, Blei D, eds. Proc. 32nd Internat. Conf. Machine Learn. (JMLR.org, Cambridge, MA), 1889–1897.Google Scholar
- [46] (2017) Proximal policy optimization algorithms. Preprint, submitted July 20, https://arxiv.org/abs/1707.06347.Google Scholar
- [47] (2018) Variance reduced value iteration and faster algorithms for solving Markov decision processes. Proc. 29th ACM-SIAM Sympos. Discrete Algorithms (Society for Industrial and Applied Mathematics, Philadelphia), 770–787.Google Scholar
- [48] (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484–489.Crossref, Google Scholar
- [49] (2006) PAC model-free reinforcement learning. Cohen W, Moore A, eds. Proc. 23rd Internat. Conf. Machine Learn. (JMLR.org, Cambridge, MA), 881–888.Google Scholar
- [50] (2019) Model-based RL in contextual decision processes: PAC bounds and exponential improvements over model-free approaches. Beygelzimer A, Hsu D, eds. Proc. 32nd Annual Conf. Learn. Theory (JMLR.org, Cambridge, MA), 2898–2933.Google Scholar
- [51] (1988) Learning to predict by the methods of temporal differences. Machine Learn. 3(1):9–44.Crossref, Google Scholar
- [52] (2011) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
- [53] (2010) Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning (Springer, Cham, Switzerland).Crossref, Google Scholar
- [54] (1997) Analysis of temporal-diffference learning with function approximation. Jordan M, Kearns M, Solla S, eds. Advances in Neural Information Processing Systems, vol. 9 (MIT Press, Cambridge, MA), 1075–1081.Google Scholar
- [55] (2010) Introduction to the non-asymptotic analysis of random matrices. Preprint, submitted November 12, https://arxiv.org/abs/1011.3027.Google Scholar
- [56] (2019) Variance-reduced Q-learning is minimax optimal. Preprint, submitted June 11, https://arxiv.org/abs/1906.04697.Google Scholar
- [57] (2019) Toward practical Lipschitz stochastic bandits. Preprint, submitted January 26, https://arxiv.org/abs/1901.09277.Google Scholar
- [58] (2021) Exponential lower bounds for planning in MDPs with linearly-realizable optimal action-value functions. Feldman V, Ligett K, Sabato S, eds. Proc. 32nd Internat. Conf. Algorithmic Learn. Theory (Microtome, Brookline, MA), 1237–1264.Google Scholar
- [59] (2013) Efficient exploration and value function generalization in deterministic systems. Burges CJ, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 26 (Curran Associates, Red Hook, NY), 3021–3029.Google Scholar
- [60] (2017) Efficient reinforcement learning in deterministic systems with value function generalization. Math. Oper. Res. 42(3):762–782.Link, Google Scholar
- [61] (2019) Sample-optimal parametric Q-learning using linearly additive features. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learn. (Microtome, Brookline, MA), 6995–7004.Google Scholar
- [62] (2020) Reinforcement leaning in feature space: Matrix bandit, kernels, and regret bound. Daumé III H, Singh A, eds. Proc. 37th Internat. Conf. Machine Learn. (Microtome, Brookline, MA), 10746–10756.Google Scholar
- [63] (2019) Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learn. (Microtome, Brookline, MA), 7304–7312.Google Scholar
- [64] (2020) Nonstationary reinforcement learning with linear function approximation. Preprint, submitted October 8, https://arxiv.org/abs/2010.04244.Google Scholar
- [65] (2019) Lipschitz bandit optimization with improved efficiency. Preprint, submitted April 25, https://arxiv.org/abs/1904.11131.Google Scholar

