Provably Efficient Reinforcement Learning with Linear Function Approximation

Published Online:https://doi.org/10.1287/moor.2022.1309

References

  • [1] Abbasi-Yadkori Y, Szepesvári C (2011) Regret bounds for the adaptive control of linear quadratic systems. Proc. 24th Annual Conf. Learn. Theory (JMLR.org, Cambridge, MA), 1–26.Google Scholar
  • [2] Abbasi-Yadkori Y, Lazic N, Szepesvári C (2019) Model-free linear quadratic control via reduction to expert prediction. Chaudhuri K, Sugiyama M, eds. Proc. 22nd Internat. Conf. Artificial Intelligence Statist. (Society for Artificial Intelligence and Statistics, New Jersey), 3108–3117.Google Scholar
  • [3] Abbasi-Yadkori Y, Pál D, Szepesvári C (2011) Improved algorithms for linear stochastic bandits. Shawe-Taylor J, R. Zemel R, Bartlett P, Pereira F, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 24 (Curran Associates, Red Hook, NY), 2312–2320.Google Scholar
  • [4] Abeille M, Lazaric A (2018) Improved regret bounds for Thompson sampling in linear quadratic control problems. Dy J, Krause A, eds. Proc. 35th Internat. Conf. Machine Learn. (JMLR.org, Cambridge, MA), 1–9.Google Scholar
  • [5] Agarwal A, Kakade S, Krishnamurthy A, Sun W (2020) FLAMBE: Structural complexity and representation learning of low rank MDPs. Preprint, submitted June 18, https://arxiv.org/abs/2006.10814.Google Scholar
  • [6] Auer P (2002) Using confidence bounds for exploitation-exploration trade-offs. J. Machine Learn. Res. 3(November):397–422.Google Scholar
  • [7] Azar MG, Munos R, Kappen B (2012) On the sample complexity of reinforcement learning with a generative model. Preprint, submitted June 27, https://arxiv.org/abs/1206.6461.Google Scholar
  • [8] Azar MG, Osband I, Munos R (2017) Minimax regret bounds for reinforcement learning. Precup D, Teh YW, eds. Proc. 34th Internat. Conf. Machine Learn. (JMLR.org, Cambridge, MA), 263–272.Google Scholar
  • [9] Azar MG, Munos R, Ghavamzadaeh M, Kappen HJ (2011) Speedy Q-learning. Shawe-Taylor J, R. Zemel R, Bartlett P, Pereira F, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 24 (MIT Press, Cambridge, MA), 2411–2419.Google Scholar
  • [10] Azizzadenesheli K, Brunskill E, Anandkumar A (2018) Efficient exploration through Bayesian deep q-networks. Proc. 2018 Inform. Theory Appl. Workshop (Institute of Electrical and Electronics Engineers, Piscataway, NJ), 1–9.Google Scholar
  • [11] Baird L (1995) Residual algorithms: Reinforcement learning with function approximation. Proc. 12th Internat. Conf. Machine Learn. (Morgan Kaufmann, Burlington, MA), 30–37.Google Scholar
  • [12] Boyan JA, Moore AW (1994) Generalization in reinforcement learning: Safely approximating the value function. Tesauro G, Touretzky D, Leen T, eds. Advances in Neural Information Processing Systems, vol. 7 (MIT Press, Cambridge, MA), 369–376.Google Scholar
  • [13] Bradtke SJ, Barto AG (1996) Linear least-squares algorithms for temporal difference learning. Machine Learn. 22(1–3):33–57.CrossrefGoogle Scholar
  • [14] Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learn. 5(1):1–122.CrossrefGoogle Scholar
  • [15] Chu W, Li L, Reyzin L, Schapire R (2011) Contextual bandits with linear payoff functions. Gordon G, Dunson D, Dudík M, eds. Proc. 14th Internat. Conf. Artificial Intelligence Statist. (AISTATS) (Society for Artificial Intelligence and Statistics, New Jersey), 208–214.Google Scholar
  • [16] Cohen A, Koren T, Mansour Y (2019) Learning linear-quadratic regulators efficiently with only T regret. Preprint, submitted February 17, https://arxiv.org/abs/1902.06223.Google Scholar
  • [17] Dani V, Hayes TP, Kakade SM (2008) Stochastic linear optimization under bandit feedback. Servedio R, Zhang T, eds. Proc. 21st Annual Conf. Learn. Theory (Association for Computational Learning, Mountain View, CA), 355–366.Google Scholar
  • [18] Dann C, Brunskill E (2015) Sample complexity of episodic fixed-horizon reinforcement learning. Cortes C, Lawrence N, Lee D, Sugiyama M, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 28 (MIT Press, Cambridge, MA), 2818–2826.Google Scholar
  • [19] Dann C, Lattimore T, Brunskill E (2017) Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 30 (Curran Associates, Red Hook, NY), 5713–5723.Google Scholar
  • [20] Dann C, Li L, Wei W, Brunskill E (2019) Policy certificates: Toward accountable reinforcement learning. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learn (Microtome, Brookline, MA), 1507–1516.Google Scholar
  • [21] Dann C, Jiang N, Krishnamurthy A, Agarwal A, Langford J, Schapire RE (2018) On oracle-efficient PAC RL with rich observations. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 31 (Curran Associates, Red Hook, NY), 1422–1432.Google Scholar
  • [22] Dean S, Mania H, Matni N, Recht B, Tu S (2018) Regret bounds for robust adaptive control of the linear quadratic regulator. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 31 (Curran Associates, Red Hook, NY), 4188–4197.Google Scholar
  • [23] Dong K, Peng J, Wang Y, Zhou Y (2020) n-regret for learning in Markov decision processes with function approximation and low Bellman rank. Abernethy J, Agarwal S, eds. Proc. 33rd Conf. Learn. Theory (Proceedings of Machine Learning Research), 1554–1557.Google Scholar
  • [24] Du SS, Kakade SM, Wang R, Yang LF (2019a) Is a good representation sufficient for sample efficient reinforcement learning? Preprint, submitted October 7, https://arxiv.org/abs/1910.03016.Google Scholar
  • [25] Du SS, Luo Y, Wang R, Zhang H (2019b) Provably efficient Q-learning with function approximation via distribution shift error checking oracle. Preprint, submitted June 14, https://arxiv.org/abs/1906.06321.Google Scholar
  • [26] Jaksch T, Ortner R, Auer P (2010) Near-optimal regret bounds for reinforcement learning. J. Machine Learn. Res. 11(4):1563–1600.Google Scholar
  • [27] Jiang N, Krishnamurthy A, Agarwal A, Langford J, Schapire RE (2017) Contextual decision processes with low Bellman rank are PAC-learnable. Precup D, Teh YW, eds. Proc. 34th Internat. Conf. Machine Learn. (JMLR.org, Cambridge, MA), 1704–1713.Google Scholar
  • [28] Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is Q-learning provably efficient? Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Advances in Neural Information Processing Systems, vol. 31 (Curran Associates, Red Hook, NY), 4863–4873.Google Scholar
  • [29] Kearns M, Singh S (2002) Near-optimal reinforcement learning in polynomial time. Machine Learn. 49(2–3):209–232.CrossrefGoogle Scholar
  • [30] Kober J, Peters J (2012) Reinforcement learning in robotics: A survey. Wiering M, van Otterlo M, eds. Reinforcement Learning: State-of-the-Art. Adaptation, Learning, and Optimization, vol. 12 (Springer, Berlin), 579–610.Google Scholar
  • [31] Koenig S, Simmons RG (1993) Complexity analysis of real-time reinforcement learning. Fikes R, Lehnert WG, eds. Proc. 11th Natl. Conf. Artificial Intelligence (Association for the Advancement of Artificial Intelligence, Menlo Park, CA), 99–107.Google Scholar
  • [32] Lattimore T, Hutter M (2012) PAC bounds for discounted MDPs. Bshouty NH, Stoltz G, Vayatis N, Zeugmann T, eds. Proc. 23rd Internat. Conf. Algorithmic Learn. Theory (Springer, Berlin), 320–334.Google Scholar
  • [33] Lattimore T, Szepesvári C (2020) Bandit Algorithms (Cambridge University Press, Cambridge, UK).Google Scholar
  • [34] Li J, Monroe W, Ritter A, Galley M, Gao J, Jurafsky D (2016) Deep reinforcement learning for dialogue generation. Preprint, submitted June 5, https://arxiv.org/abs/1606.01541.Google Scholar
  • [35] Li L, Chu W, Langford J, Schapire RE (2010) A contextual-bandit approach to personalized news article recommendation. Proc. 19th Internat. Conf. World Wide Web (Association for Computing Machinery, New York), 661–670.Google Scholar
  • [36] Melo FS, Ribeiro MI (2007) Q-learning with linear function approximation. Bshouty NH, Gentile C, eds. Proc. 20th Annual Conf. Learn. Theory (Springer, Berlin), 308–322.Google Scholar
  • [37] Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing Atari with deep reinforcement learning. Preprint, submitted December 19, https://arxiv.org/abs/1312.5602.Google Scholar
  • [38] Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. Balcan MF, Weinberger KQ, eds. Proc. 33rd Internat. Conf. Machine Learn. (Curran Associates, Red Hook, NY), 1928–1937.Google Scholar
  • [39] Osband I, Van Roy B (2016) On lower bounds for regret in reinforcement learning. Preprint, submitted August 9, https://arxiv.org/abs/1608.02732.Google Scholar
  • [40] Osband I, Van Roy B, Wen Z (2016) Generalization and exploration via randomized value functions. Balcan MF, Weinberger KQ, eds. Proc. 33rd Internat. Conf. Machine Learn. (Curran Associates, Red Hook, NY), 2377–2386.Google Scholar
  • [41] Osband I, Van Roy B, Russo D, Wen Z (2019) Deep exploration via randomized value functions. J. Machine Learn. Res. 20(124):1–62.Google Scholar
  • [42] Puterman ML (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming (John Wiley & Sons, Hoboken, NJ).Google Scholar
  • [43] Rusmevichientong P, Tsitsiklis JN (2010) Linearly parameterized bandits. Math. Oper. Res. 35(2):395–411.LinkGoogle Scholar
  • [44] Russo D (2019) Worst-case regret bounds for exploration via randomized value functions. Chaudhuri K, Salakhutdinov R, eds. Proc. 33rd Internat. Conf. Neural Inform. Processing Systems (Curran Associates, Red Hook, NY), 14433–14443.Google Scholar
  • [45] Schulman J, Levine S, Abbeel P, Jordan M, Moritz P (2015) Trust region policy optimization. Bach F, Blei D, eds. Proc. 32nd Internat. Conf. Machine Learn. (JMLR.org, Cambridge, MA), 1889–1897.Google Scholar
  • [46] Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. Preprint, submitted July 20, https://arxiv.org/abs/1707.06347.Google Scholar
  • [47] Sidford A, Wang M, Wu X, Ye Y (2018) Variance reduced value iteration and faster algorithms for solving Markov decision processes. Proc. 29th ACM-SIAM Sympos. Discrete Algorithms (Society for Industrial and Applied Mathematics, Philadelphia), 770–787.Google Scholar
  • [48] Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, et al. (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484–489.CrossrefGoogle Scholar
  • [49] Strehl AL, Li L, Wiewiora E, Langford J, Littman ML (2006) PAC model-free reinforcement learning. Cohen W, Moore A, eds. Proc. 23rd Internat. Conf. Machine Learn. (JMLR.org, Cambridge, MA), 881–888.Google Scholar
  • [50] Sun W, Jiang N, Krishnamurthy A, Agarwal A, Langford J (2019) Model-based RL in contextual decision processes: PAC bounds and exponential improvements over model-free approaches. Beygelzimer A, Hsu D, eds. Proc. 32nd Annual Conf. Learn. Theory (JMLR.org, Cambridge, MA), 2898–2933.Google Scholar
  • [51] Sutton RS (1988) Learning to predict by the methods of temporal differences. Machine Learn. 3(1):9–44.CrossrefGoogle Scholar
  • [52] Sutton RS, Barto AG (2011) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
  • [53] Szepesvári C (2010) Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning (Springer, Cham, Switzerland).CrossrefGoogle Scholar
  • [54] Tsitsiklis JN, Van Roy B (1997) Analysis of temporal-diffference learning with function approximation. Jordan M, Kearns M, Solla S, eds. Advances in Neural Information Processing Systems, vol. 9 (MIT Press, Cambridge, MA), 1075–1081.Google Scholar
  • [55] Vershynin R (2010) Introduction to the non-asymptotic analysis of random matrices. Preprint, submitted November 12, https://arxiv.org/abs/1011.3027.Google Scholar
  • [56] Wainwright MJ (2019) Variance-reduced Q-learning is minimax optimal. Preprint, submitted June 11, https://arxiv.org/abs/1906.04697.Google Scholar
  • [57] Wang T, Ye W, Geng D, Rudin C (2019) Toward practical Lipschitz stochastic bandits. Preprint, submitted January 26, https://arxiv.org/abs/1901.09277.Google Scholar
  • [58] Weisz G, Amortila P, Szepesvári C (2021) Exponential lower bounds for planning in MDPs with linearly-realizable optimal action-value functions. Feldman V, Ligett K, Sabato S, eds. Proc. 32nd Internat. Conf. Algorithmic Learn. Theory (Microtome, Brookline, MA), 1237–1264.Google Scholar
  • [59] Wen Z, Van Roy B (2013) Efficient exploration and value function generalization in deterministic systems. Burges CJ, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 26 (Curran Associates, Red Hook, NY), 3021–3029.Google Scholar
  • [60] Wen Z, Van Roy B (2017) Efficient reinforcement learning in deterministic systems with value function generalization. Math. Oper. Res. 42(3):762–782.LinkGoogle Scholar
  • [61] Yang L, Wang M (2019) Sample-optimal parametric Q-learning using linearly additive features. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learn. (Microtome, Brookline, MA), 6995–7004.Google Scholar
  • [62] Yang LF, Wang M (2020) Reinforcement leaning in feature space: Matrix bandit, kernels, and regret bound. Daumé III H, Singh A, eds. Proc. 37th Internat. Conf. Machine Learn. (Microtome, Brookline, MA), 10746–10756.Google Scholar
  • [63] Zanette A, Brunskill E (2019) Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learn. (Microtome, Brookline, MA), 7304–7312.Google Scholar
  • [64] Zhou H, Chen J, Varshney LR, Jagmohan A (2020) Nonstationary reinforcement learning with linear function approximation. Preprint, submitted October 8, https://arxiv.org/abs/2010.04244.Google Scholar
  • [65] Zhu X, Dunson DB (2019) Lipschitz bandit optimization with improved efficiency. Preprint, submitted April 25, https://arxiv.org/abs/1904.11131.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.