Open Access

Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model

Gen Li
Gen Li
[email protected]
Department of Statistics and Data Science, The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104;
Search for more papers by this author
,
Yuting Wei
Corresponding Author
Yuting Wei
[email protected]
https://orcid.org/0000-0003-1488-4647
Department of Statistics and Data Science, The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104;
Search for more papers by this author
,
Yuejie Chi
Yuejie Chi
[email protected]
https://orcid.org/0000-0002-6766-5459
Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213;
Search for more papers by this author
,
Yuxin Chen
Yuxin Chen
[email protected]
https://orcid.org/0000-0001-9256-5815
Department of Statistics and Data Science, The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104;Department of Electrical and Systems Engineering, University of Pennsylvania, Philadelphia, Pennsylvania 19104
Search for more papers by this author

Department of Statistics and Data Science, The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104;

Search for more papers by this author

Yuting Wei

Corresponding Author

Yuting Wei

[email protected]

https://orcid.org/0000-0003-1488-4647

Department of Statistics and Data Science, The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104;

Search for more papers by this author

Yuejie Chi

[email protected]

https://orcid.org/0000-0002-6766-5459

Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213;

Search for more papers by this author

Yuxin Chen

[email protected]

https://orcid.org/0000-0001-9256-5815

Department of Statistics and Data Science, The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104;Department of Electrical and Systems Engineering, University of Pennsylvania, Philadelphia, Pennsylvania 19104

Search for more papers by this author

Published Online:21 Apr 2023https://doi.org/10.1287/opre.2023.2451

References

Agarwal A, Kakade S, Yang LF (2020) Model-based reinforcement learning with a generative model is minimax optimal. Conf. Learn. Theory (COLT) (PMLR, New York), 67–83.Google Scholar
Azar MG, Munos R, Kappen B (2012) On the sample complexity of reinforcement learning with a generative model. Preprint, submitted June 27, https://arxiv.org/abs/1206.6461.Google Scholar
Azar MG, Munos R, Kappen HJ (2013) Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine Learn. 91(3):325–349.Crossref, Google Scholar
Azar MG, Osband I, Munos R (2017) Minimax regret bounds for reinforcement learning. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (PMLR, New York), 263–272.Google Scholar
Bai Y, Xie T, Jiang N, Wang YX (2019) Provably efficient Q-learning with low switching cost. Adv. Neural Inform. Processing Systems 32 (NeurIPS, San Diego).Google Scholar
Beck CL, Srikant R (2012) Error bounds for constant step-size Q-learning. Systems Control Lett. 61(12):1203–1208.Crossref, Google Scholar
Bellman R (1952) On the theory of dynamic programming. Proc. Natl. Acad. Sci. USA 38(8):716–719.Crossref, Google Scholar
Bertsekas DP (2017) Dynamic Programming and Optimal Control, 4th ed. (Athena Scientific, Nashua, NH).Google Scholar
Bhandari J, Russo D, Singal R (2018) A finite time analysis of temporal difference learning with linear function approximation. Conf. Learn. Theory (COLT) (PMLR, New York), 1691–1692.Google Scholar
Bradtke SJ, Barto AG (1996) Linear least-squares algorithms for temporal difference learning. Machine Learn. 22(1–3):33–57.Crossref, Google Scholar
Cai Q, Yang Z, Lee JD, Wang Z (2019) Neural temporal-difference learning converges to global optima. Adv. Neural Inform. Processing Systems (NeurIPS, San Diego), 11312–11322.Google Scholar
Chen Y, Chi Y, Fan J, Ma C (2021) Spectral methods for data science: A statistical perspective. Foundations Trends Machine Learn. 14(5):566–806.Crossref, Google Scholar
Chen Y, Fan J, Ma C, Wang K (2019a) Spectral method and regularized MLE are both optimal for top-K ranking. Ann. Statist. 47(4):2204–2235.Crossref, Google Scholar
Chen Y, Fan J, Ma C, Yan Y (2019b) Inference and uncertainty quantification for noisy matrix completion. Proc. Natl. Acad. Sci. USA 116(46):22931–22937.Crossref, Google Scholar
Chen Z, Maguluri ST, Shakkottai S, Shanmugam K (2020) Finite-sample analysis of contractive stochastic approximation using smooth convex envelopes. Preprint, submitted October 21, https://arxiv.org/abs/2002.00874v4.Google Scholar
Cui Q, Yang LF (2021) Minimax sample complexity for turn-based stochastic game. Proc. Thirty-Seventh Conf. Uncertainty Artificial Intelligence, vol. 161 (PMLR, New York), 1496–1504.Google Scholar
Dalal G, Szörényi B, Thoppe G, Mannor S (2018) Finite sample analyses for TD(0) with function approximation. Thirty-Second AAAI Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA).Google Scholar
Domingues OD, Ménard P, Kaufmann E, Valko M (2021) Episodic reinforcement learning in finite MDPs: Minimax lower bounds revisited. Algorithmic Learn. Theory (PMLR, New York), 578–598.Google Scholar
El Karoui N (2018) On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators. Probab. Theory Related Fields 170:95–175.Crossref, Google Scholar
Even-Dar E, Mansour Y (2003) Learning rates for Q-learning. J. Mach. Learn. Res. 5(December):1–25.Google Scholar
Fan J, Wang Z, Xie Y, Yang Z (2019) A theoretical analysis of deep Q-learning. Preprint, submitted May 29, https://arxiv.org/abs/1901.00137v2.Google Scholar
Gupta H, Srikant R, Ying L (2019) Finite-time performance bounds and adaptive learning rate selection for two time-scale reinforcement learning. Adv. Neural Inform. Processing Systems (NeurIPS, San Diego), 4706–4715.Google Scholar
Jaakkola T, Jordan MI, Singh SP (1994) Convergence of stochastic iterative dynamic programming algorithms. Adv. Neural Inform. Processing Systems (NeurIPS, San Diego), 703–710.Google Scholar
Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is Q-learning provably efficient? Adv. Neural Inform. Processing Systems (NeurIPS, San Diego), 4863–4873.Google Scholar
Jin C, Yang Z, Wang Z, Jordan MI (2020) Provably efficient reinforcement learning with linear function approximation. Conf. Learn. Theory (PMLR, New York), 2137–2143.Google Scholar
Kakade S (2003) On the sample complexity of reinforcement learning. PhD thesis, University of London, London.Google Scholar
Kaledin M, Moulines E, Naumov A, Tadic V, Wai HT (2020) Finite time analysis of linear two-timescale stochastic approximation with Markovian noise. Preprint, submitted February 4, https://arxiv.org/abs/2002.01268.Google Scholar
Kearns MJ, Singh SP (1999) Finite-sample convergence rates for Q-learning and indirect algorithms. Adv. Neural Inform. Processing Systems (NeurIPS, San Diego), 996–1002.Google Scholar
Kearns M, Mansour Y, Ng AY (2002) A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine Learn. 49(2–3):193–208.Crossref, Google Scholar
Khamaru K, Pananjady A, Ruan F, Wainwright MJ, Jordan MI (2020) Is temporal difference learning optimal? An instance-dependent analysis. Preprint, submitted March 16, https://arxiv.org/abs/2003.07337.Google Scholar
Lakshminarayanan C, Szepesvari C (2018) Linear stochastic approximation: How far does constant step-size and iterate averaging go? Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 1347–1355.Google Scholar
Lattimore T, Hutter M (2012) PAC bounds for discounted MDPs. Internat. Conf. Algorithmic Learn. Theory (Springer, Berlin), 320–334.Google Scholar
Li G, Chi Y, Wei Y, Chen Y (2022a) Minimax-optimal multi-agent RL in zero-sum Markov games with a generative model. Preprint, submitted October 12, https://arxiv.org/abs/2208.10458.Google Scholar
Li G, Cai C, Chen Y, Wei Y, Chi Y (2023) Is Q-learning minimax optimal? A tight sample complexity analysis. Oper. Res. Forthcoming.Google Scholar
Li G, Chen Y, Chi Y, Gu Y, Wei Y (2021a) Sample-efficient reinforcement learning is feasible for linearly realizable MDPs with limited revisiting. Adv. Neural Inform. Processing Systems 34 (NeurIPS, San Diego), 16671–16685.Google Scholar
Li G, Shi L, Chen Y, Chi Y, Wei Y (2022b) Settling the sample complexity of model-based offline reinforcement learning. Preprint, submitted April 11, https://arxiv.org/abs/2204.05275v1.Google Scholar
Li G, Shi L, Chen Y, Gu Y, Chi Y (2021b) Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning. Adv. Neural Inform. Processing Systems 34 (NeurIPS, San Diego), 17762–17776.Google Scholar
Li G, Wei Y, Chi Y, Gu Y, Chen Y (2020) Breaking the sample size barrier in model-based reinforcement learning with a generative model. Adv. Neural Inform. Processing Systems 33 (NeurIPS, San Diego), 12861–12872.Google Scholar
Li G, Wei Y, Chi Y, Gu Y, Chen Y (2022c) Sample complexity of asynchronous Q-learning: Sharper analysis and variance reduction. IEEE Trans. Inform. Theory 68(1):448–473.Crossref, Google Scholar
Li G, Cai C, Chen Y, Gu Y, Wei Y, Chi Y (2021c) Tightening the dependence on horizon in the sample complexity of Q-learning. Internat. Conf. Machine Learn. (PMLR, New York), 6296–6306.Google Scholar
Ma C, Wang K, Chi Y, Chen Y (2020) Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval, matrix completion and blind deconvolution. Foundations Comput. Math. 20(3):451–632.Crossref, Google Scholar
Mou W, Li CJ, Wainwright MJ, Bartlett PL, Jordan MI (2020) On linear stochastic approximation: Fine-grained Polyak-Ruppert and non-asymptotic concentration. Preprint, submitted April 9, https://arxiv.org/abs/2004.04719.Google Scholar
Pananjady A, Wainwright MJ (2019) Value function estimation in Markov reward processes: Instance-dependent ℓ∞-bounds for policy evaluation. Preprint, submitted September 19, https://arxiv.org/abs/1909.08749v1.Google Scholar
Qu G, Wierman A (2020) Finite-time analysis of asynchronous stochastic approximation and Q-learning. Conf. Learn. Theory, 3185–3205.Google Scholar
Shah D, Xie Q (2018) Q-learning with nearest neighbors. Adv. Neural Inform. Processing Systems (NeurIPS, San Diego), 3111–3121.Google Scholar
Shi L, Li G, Wei Y, Chen Y, Chi Y (2022) Pessimistic Q-learning for offline reinforcement learning: Toward optimal sample complexity. Internat. Conf. Machine Learn. (PMLR, New York).Google Scholar
Sidford A, Wang M, Wu X, Ye Y (2018a) Variance reduced value iteration and faster algorithms for solving Markov decision processes. Proc. Twenty-Ninth Annual ACM-SIAM Sympos. Discrete Algorithms, 770–787.Google Scholar
Sidford A, Wang M, Wu X, Yang L, Ye Y (2018b) Near-optimal time and sample complexities for solving Markov decision processes with a generative model. Adv. Neural Inform. Processing Systems, 5186–5196.Google Scholar
Srikant R, Ying L (2019) Finite-time error bounds for linear stochastic approximation and TD learning. Conf. Learn. Theory (PMLR, New York), 2803–2830.Google Scholar
Strehl AL, Li L, Wiewiora E, Langford J, Littman ML (2006) PAC model-free reinforcement learning. Internat. Conf. Machine Learn. (PMLR, New York), 881–888.Google Scholar
Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
Szepesvári C (1998) The asymptotic convergence-rate of Q-learning. Adv. Neural Inform. Processing Systems (NeurIPS, San Diego), 1064–1070.Google Scholar
Szepesvári C (2010) Algorithms for Reinforcement Learning, Synthesis Lectures on Artificial Intelligence and Machine Learning (Springer, Cham, Switzerland).Crossref, Google Scholar
Tsitsiklis JN (1994) Asynchronous stochastic approximation and Q-learning. Machine Learn. 16(3):185–202.Crossref, Google Scholar
Tsitsiklis J, Van Roy B (1997) An analysis of temporal-difference learning with function approximation. IEEE Trans. Automatic Control 42(5):674–690.Crossref, Google Scholar
Tu S, Recht B (2019) The gap between model-based and model-free methods on the linear quadratic regulator: An asymptotic viewpoint. Conf. Learn. Theory (PMLR, New York), 3036–3083.Google Scholar
Vershynin R (2018) High-Dimensional Probability: An Introduction with Applications in Data Science, vol. 47 (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
Wainwright MJ (2019a) Stochastic approximation with cone-contractive operators: Sharp ℓ∞-bounds for Q-learning. Preprint, submitted June 24, https://arxiv.org/abs/1905.06265v2.Google Scholar
Wainwright MJ (2019b) Variance-reduced Q-learning is minimax optimal. Preprint, submitted August 8, https://arxiv.org/abs/1906.04697.Google Scholar
Wang M (2020) Randomized linear programming solves the Markov decision problem in nearly linear (sometimes sublinear) time. Math. Oper. Res. 45(2):517–546.Link, Google Scholar
Wang B, Yan Y, Fan J (2021) Sample-efficient reinforcement learning for linearly-parameterized MDPs with a generative model. Neural Inform. Processing Systems (NeurIPS, San Diego), 23009–23022.Google Scholar
Xu P, Gu Q (2020) A finite-time analysis of Q-learning with neural network function approximation. Internat. Conf. Machine Learn. (PMLR, New York), 10555–10565.Google Scholar
Xu T, Zou S, Liang Y (2019) Two time-scale off-policy TD learning: Non-asymptotic analysis over Markovian samples. Adv. Neural Inform. Processing Systems (NeurIPS, San Diego), 10633–10643.Google Scholar
Yan Y, Chen Y, Fan J (2021) Inference for heteroskedastic PCA with missing data. Preprint, submitted July 26, https://arxiv.org/abs/2107.12365.Google Scholar
Yan Y, Li G, Chen Y, Fan J (2022a) Model-based reinforcement learning is minimax-optimal for offline zero-sum Markov games. Preprint, submitted June 8, https://arxiv.org/abs/2206.04044.Google Scholar
Yan Y, Li G, Chen Y, Fan J (2022b) The efficacy of pessimism in asynchronous Q-learning. Preprint, submitted March 14, https://arxiv.org/abs/2203.07368.Google Scholar
Yang L, Wang M (2019) Sample-optimal parametric Q-learning using linearly additive features. Internat. Conf. Machine Learn. (PMLR, New York), 6995–7004.Google Scholar
Yin M, Bai Y, Wang YX (2021) Near-optimal provable uniform convergence in offline policy evaluation for reinforcement learning. Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 1567–1575.Google Scholar

Volume 72, Issue 1

January-February 2024

Pages iii-vi, 1-424, C2-C3

Article Information

Supplemental Material

Metrics

Information

Received:December 25, 2021
Accepted:March 16, 2023
Published Online:April 21, 2023

Cite as

Gen Li, Yuting Wei, Yuejie Chi, Yuxin Chen (2023) Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model. Operations Research 72(1):203-221.

https://doi.org/10.1287/opre.2023.2451

Keywords

Acknowledgments

The authors thank Qiwen Cui for pointing out an issue in an early version of this paper and Shicong Cen, Chen Cheng, and Cong Ma for numerous discussions. Partial results of this paper have been presented in Advances in Neural Information Processing Systems 33 (Li et al. 2020).

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model

References

Volume 72, Issue 1

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News