Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model

Published Online:https://doi.org/10.1287/opre.2023.2451

References

  • Agarwal A, Kakade S, Yang LF (2020) Model-based reinforcement learning with a generative model is minimax optimal. Conf. Learn. Theory (COLT) (PMLR, New York), 67–83.Google Scholar
  • Azar MG, Munos R, Kappen B (2012) On the sample complexity of reinforcement learning with a generative model. Preprint, submitted June 27, https://arxiv.org/abs/1206.6461.Google Scholar
  • Azar MG, Munos R, Kappen HJ (2013) Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine Learn. 91(3):325–349.CrossrefGoogle Scholar
  • Azar MG, Osband I, Munos R (2017) Minimax regret bounds for reinforcement learning. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (PMLR, New York), 263–272.Google Scholar
  • Bai Y, Xie T, Jiang N, Wang YX (2019) Provably efficient Q-learning with low switching cost. Adv. Neural Inform. Processing Systems 32 (NeurIPS, San Diego).Google Scholar
  • Beck CL, Srikant R (2012) Error bounds for constant step-size Q-learning. Systems Control Lett. 61(12):1203–1208.CrossrefGoogle Scholar
  • Bellman R (1952) On the theory of dynamic programming. Proc. Natl. Acad. Sci. USA 38(8):716–719.CrossrefGoogle Scholar
  • Bertsekas DP (2017) Dynamic Programming and Optimal Control, 4th ed. (Athena Scientific, Nashua, NH).Google Scholar
  • Bhandari J, Russo D, Singal R (2018) A finite time analysis of temporal difference learning with linear function approximation. Conf. Learn. Theory (COLT) (PMLR, New York), 1691–1692.Google Scholar
  • Bradtke SJ, Barto AG (1996) Linear least-squares algorithms for temporal difference learning. Machine Learn. 22(1–3):33–57.CrossrefGoogle Scholar
  • Cai Q, Yang Z, Lee JD, Wang Z (2019) Neural temporal-difference learning converges to global optima. Adv. Neural Inform. Processing Systems (NeurIPS, San Diego), 11312–11322.Google Scholar
  • Chen Y, Chi Y, Fan J, Ma C (2021) Spectral methods for data science: A statistical perspective. Foundations Trends Machine Learn. 14(5):566–806.CrossrefGoogle Scholar
  • Chen Y, Fan J, Ma C, Wang K (2019a) Spectral method and regularized MLE are both optimal for top-K ranking. Ann. Statist. 47(4):2204–2235.CrossrefGoogle Scholar
  • Chen Y, Fan J, Ma C, Yan Y (2019b) Inference and uncertainty quantification for noisy matrix completion. Proc. Natl. Acad. Sci. USA 116(46):22931–22937.CrossrefGoogle Scholar
  • Chen Z, Maguluri ST, Shakkottai S, Shanmugam K (2020) Finite-sample analysis of contractive stochastic approximation using smooth convex envelopes. Preprint, submitted October 21, https://arxiv.org/abs/2002.00874v4.Google Scholar
  • Cui Q, Yang LF (2021) Minimax sample complexity for turn-based stochastic game. Proc. Thirty-Seventh Conf. Uncertainty Artificial Intelligence, vol. 161 (PMLR, New York), 1496–1504.Google Scholar
  • Dalal G, Szörényi B, Thoppe G, Mannor S (2018) Finite sample analyses for TD(0) with function approximation. Thirty-Second AAAI Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA).Google Scholar
  • Domingues OD, Ménard P, Kaufmann E, Valko M (2021) Episodic reinforcement learning in finite MDPs: Minimax lower bounds revisited. Algorithmic Learn. Theory (PMLR, New York), 578–598.Google Scholar
  • El Karoui N (2018) On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators. Probab. Theory Related Fields 170:95–175.CrossrefGoogle Scholar
  • Even-Dar E, Mansour Y (2003) Learning rates for Q-learning. J. Mach. Learn. Res. 5(December):1–25.Google Scholar
  • Fan J, Wang Z, Xie Y, Yang Z (2019) A theoretical analysis of deep Q-learning. Preprint, submitted May 29, https://arxiv.org/abs/1901.00137v2.Google Scholar
  • Gupta H, Srikant R, Ying L (2019) Finite-time performance bounds and adaptive learning rate selection for two time-scale reinforcement learning. Adv. Neural Inform. Processing Systems (NeurIPS, San Diego), 4706–4715.Google Scholar
  • Jaakkola T, Jordan MI, Singh SP (1994) Convergence of stochastic iterative dynamic programming algorithms. Adv. Neural Inform. Processing Systems (NeurIPS, San Diego), 703–710.Google Scholar
  • Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is Q-learning provably efficient? Adv. Neural Inform. Processing Systems (NeurIPS, San Diego), 4863–4873.Google Scholar
  • Jin C, Yang Z, Wang Z, Jordan MI (2020) Provably efficient reinforcement learning with linear function approximation. Conf. Learn. Theory (PMLR, New York), 2137–2143.Google Scholar
  • Kakade S (2003) On the sample complexity of reinforcement learning. PhD thesis, University of London, London.Google Scholar
  • Kaledin M, Moulines E, Naumov A, Tadic V, Wai HT (2020) Finite time analysis of linear two-timescale stochastic approximation with Markovian noise. Preprint, submitted February 4, https://arxiv.org/abs/2002.01268.Google Scholar
  • Kearns MJ, Singh SP (1999) Finite-sample convergence rates for Q-learning and indirect algorithms. Adv. Neural Inform. Processing Systems (NeurIPS, San Diego), 996–1002.Google Scholar
  • Kearns M, Mansour Y, Ng AY (2002) A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine Learn. 49(2–3):193–208.CrossrefGoogle Scholar
  • Khamaru K, Pananjady A, Ruan F, Wainwright MJ, Jordan MI (2020) Is temporal difference learning optimal? An instance-dependent analysis. Preprint, submitted March 16, https://arxiv.org/abs/2003.07337.Google Scholar
  • Lakshminarayanan C, Szepesvari C (2018) Linear stochastic approximation: How far does constant step-size and iterate averaging go? Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 1347–1355.Google Scholar
  • Lattimore T, Hutter M (2012) PAC bounds for discounted MDPs. Internat. Conf. Algorithmic Learn. Theory (Springer, Berlin), 320–334.Google Scholar
  • Li G, Chi Y, Wei Y, Chen Y (2022a) Minimax-optimal multi-agent RL in zero-sum Markov games with a generative model. Preprint, submitted October 12, https://arxiv.org/abs/2208.10458.Google Scholar
  • Li G, Cai C, Chen Y, Wei Y, Chi Y (2023) Is Q-learning minimax optimal? A tight sample complexity analysis. Oper. Res. Forthcoming.Google Scholar
  • Li G, Chen Y, Chi Y, Gu Y, Wei Y (2021a) Sample-efficient reinforcement learning is feasible for linearly realizable MDPs with limited revisiting. Adv. Neural Inform. Processing Systems 34 (NeurIPS, San Diego), 16671–16685.Google Scholar
  • Li G, Shi L, Chen Y, Chi Y, Wei Y (2022b) Settling the sample complexity of model-based offline reinforcement learning. Preprint, submitted April 11, https://arxiv.org/abs/2204.05275v1.Google Scholar
  • Li G, Shi L, Chen Y, Gu Y, Chi Y (2021b) Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning. Adv. Neural Inform. Processing Systems 34 (NeurIPS, San Diego), 17762–17776.Google Scholar
  • Li G, Wei Y, Chi Y, Gu Y, Chen Y (2020) Breaking the sample size barrier in model-based reinforcement learning with a generative model. Adv. Neural Inform. Processing Systems 33 (NeurIPS, San Diego), 12861–12872.Google Scholar
  • Li G, Wei Y, Chi Y, Gu Y, Chen Y (2022c) Sample complexity of asynchronous Q-learning: Sharper analysis and variance reduction. IEEE Trans. Inform. Theory 68(1):448–473.CrossrefGoogle Scholar
  • Li G, Cai C, Chen Y, Gu Y, Wei Y, Chi Y (2021c) Tightening the dependence on horizon in the sample complexity of Q-learning. Internat. Conf. Machine Learn. (PMLR, New York), 6296–6306.Google Scholar
  • Ma C, Wang K, Chi Y, Chen Y (2020) Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval, matrix completion and blind deconvolution. Foundations Comput. Math. 20(3):451–632.CrossrefGoogle Scholar
  • Mou W, Li CJ, Wainwright MJ, Bartlett PL, Jordan MI (2020) On linear stochastic approximation: Fine-grained Polyak-Ruppert and non-asymptotic concentration. Preprint, submitted April 9, https://arxiv.org/abs/2004.04719.Google Scholar
  • Pananjady A, Wainwright MJ (2019) Value function estimation in Markov reward processes: Instance-dependent ℓ∞-bounds for policy evaluation. Preprint, submitted September 19, https://arxiv.org/abs/1909.08749v1.Google Scholar
  • Qu G, Wierman A (2020) Finite-time analysis of asynchronous stochastic approximation and Q-learning. Conf. Learn. Theory, 3185–3205.Google Scholar
  • Shah D, Xie Q (2018) Q-learning with nearest neighbors. Adv. Neural Inform. Processing Systems (NeurIPS, San Diego), 3111–3121.Google Scholar
  • Shi L, Li G, Wei Y, Chen Y, Chi Y (2022) Pessimistic Q-learning for offline reinforcement learning: Toward optimal sample complexity. Internat. Conf. Machine Learn. (PMLR, New York).Google Scholar
  • Sidford A, Wang M, Wu X, Ye Y (2018a) Variance reduced value iteration and faster algorithms for solving Markov decision processes. Proc. Twenty-Ninth Annual ACM-SIAM Sympos. Discrete Algorithms, 770–787.Google Scholar
  • Sidford A, Wang M, Wu X, Yang L, Ye Y (2018b) Near-optimal time and sample complexities for solving Markov decision processes with a generative model. Adv. Neural Inform. Processing Systems, 5186–5196.Google Scholar
  • Srikant R, Ying L (2019) Finite-time error bounds for linear stochastic approximation and TD learning. Conf. Learn. Theory (PMLR, New York), 2803–2830.Google Scholar
  • Strehl AL, Li L, Wiewiora E, Langford J, Littman ML (2006) PAC model-free reinforcement learning. Internat. Conf. Machine Learn. (PMLR, New York), 881–888.Google Scholar
  • Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
  • Szepesvári C (1998) The asymptotic convergence-rate of Q-learning. Adv. Neural Inform. Processing Systems (NeurIPS, San Diego), 1064–1070.Google Scholar
  • Szepesvári C (2010) Algorithms for Reinforcement Learning, Synthesis Lectures on Artificial Intelligence and Machine Learning (Springer, Cham, Switzerland).CrossrefGoogle Scholar
  • Tsitsiklis JN (1994) Asynchronous stochastic approximation and Q-learning. Machine Learn. 16(3):185–202.CrossrefGoogle Scholar
  • Tsitsiklis J, Van Roy B (1997) An analysis of temporal-difference learning with function approximation. IEEE Trans. Automatic Control 42(5):674–690.CrossrefGoogle Scholar
  • Tu S, Recht B (2019) The gap between model-based and model-free methods on the linear quadratic regulator: An asymptotic viewpoint. Conf. Learn. Theory (PMLR, New York), 3036–3083.Google Scholar
  • Vershynin R (2018) High-Dimensional Probability: An Introduction with Applications in Data Science, vol. 47 (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • Wainwright MJ (2019a) Stochastic approximation with cone-contractive operators: Sharp ℓ∞-bounds for Q-learning. Preprint, submitted June 24, https://arxiv.org/abs/1905.06265v2.Google Scholar
  • Wainwright MJ (2019b) Variance-reduced Q-learning is minimax optimal. Preprint, submitted August 8, https://arxiv.org/abs/1906.04697.Google Scholar
  • Wang M (2020) Randomized linear programming solves the Markov decision problem in nearly linear (sometimes sublinear) time. Math. Oper. Res. 45(2):517–546.LinkGoogle Scholar
  • Wang B, Yan Y, Fan J (2021) Sample-efficient reinforcement learning for linearly-parameterized MDPs with a generative model. Neural Inform. Processing Systems (NeurIPS, San Diego), 23009–23022.Google Scholar
  • Xu P, Gu Q (2020) A finite-time analysis of Q-learning with neural network function approximation. Internat. Conf. Machine Learn. (PMLR, New York), 10555–10565.Google Scholar
  • Xu T, Zou S, Liang Y (2019) Two time-scale off-policy TD learning: Non-asymptotic analysis over Markovian samples. Adv. Neural Inform. Processing Systems (NeurIPS, San Diego), 10633–10643.Google Scholar
  • Yan Y, Chen Y, Fan J (2021) Inference for heteroskedastic PCA with missing data. Preprint, submitted July 26, https://arxiv.org/abs/2107.12365.Google Scholar
  • Yan Y, Li G, Chen Y, Fan J (2022a) Model-based reinforcement learning is minimax-optimal for offline zero-sum Markov games. Preprint, submitted June 8, https://arxiv.org/abs/2206.04044.Google Scholar
  • Yan Y, Li G, Chen Y, Fan J (2022b) The efficacy of pessimism in asynchronous Q-learning. Preprint, submitted March 14, https://arxiv.org/abs/2203.07368.Google Scholar
  • Yang L, Wang M (2019) Sample-optimal parametric Q-learning using linearly additive features. Internat. Conf. Machine Learn. (PMLR, New York), 6995–7004.Google Scholar
  • Yin M, Bai Y, Wang YX (2021) Near-optimal provable uniform convergence in offline policy evaluation for reinforcement learning. Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 1567–1575.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.