Open Access

Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis

Gen Li
Gen Li
[email protected]
Department of Statistics and Data Science, Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104;
Search for more papers by this author
,
Changxiao Cai
Changxiao Cai
[email protected]
Department of Biostatistics, University of Pennsylvania, Philadelphia, Pennsylvania 19104;
Search for more papers by this author
,
Yuxin Chen
Yuxin Chen
[email protected]
https://orcid.org/0000-0001-9256-5815
Department of Statistics and Data Science, Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104;
Search for more papers by this author
,
Yuting Wei
Corresponding Author
Yuting Wei
[email protected]
https://orcid.org/0000-0003-1488-4647
Department of Statistics and Data Science, Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104;
Search for more papers by this author
,
Yuejie Chi
Yuejie Chi
[email protected]
https://orcid.org/0000-0002-6766-5459
Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213
Search for more papers by this author

Department of Statistics and Data Science, Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104;

Search for more papers by this author

Changxiao Cai

[email protected]

Department of Biostatistics, University of Pennsylvania, Philadelphia, Pennsylvania 19104;

Search for more papers by this author

Yuxin Chen

[email protected]

https://orcid.org/0000-0001-9256-5815

Department of Statistics and Data Science, Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104;

Search for more papers by this author

Yuting Wei

Corresponding Author

Yuting Wei

[email protected]

https://orcid.org/0000-0003-1488-4647

Department of Statistics and Data Science, Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104;

Search for more papers by this author

Yuejie Chi

[email protected]

https://orcid.org/0000-0002-6766-5459

Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213

Search for more papers by this author

Published Online:21 Apr 2023https://doi.org/10.1287/opre.2023.2450

References

Agarwal A, Kakade S, Yang LF (2020) Model-based reinforcement learning with a generative model is minimax optimal. Proc. Conf. Learn. Theory (PMLR, New York), 67–83.Google Scholar
Azar MG, Munos R, Kappen HJ (2013) Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine Learn. 91(3):325–349.Crossref, Google Scholar
Azar MG, Munos R, Ghavamzadeh M, Kappen H (2011) Reinforcement learning with a near optimal rate of convergence. Technical report, INRIA.Google Scholar
Bai Y, Xie T, Jiang N, Wang YX (2019) Provably efficient Q-learning with low switching cost. Adv. Neural Inform. Processing Systems 33:8002–8011.Google Scholar
Beck CL, Srikant R (2012) Error bounds for constant step-size Q-learning. Systems Control Lett. 61(12):1203–1208.Crossref, Google Scholar
Bellman R (1952) On the theory of dynamic programming. Proc. Natl. Acad. Sci. USA 38(8):716–719.Crossref, Google Scholar
Bertsekas DP (2017) Dynamic Programming and Optimal Control, 4th ed. (Athena Scientific, Nashua, NH).Google Scholar
Bhandari J, Russo D, Singal R (2021) A finite time analysis of temporal difference learning with linear function approximation. Oper. Res. 69(3):950–973.Link, Google Scholar
Borkar VS, Meyn SP (2000) The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim. 38(2):447–469.Crossref, Google Scholar
Cai Q, Yang Z, Lee JD, Wang Z (2019) Neural temporal-difference and Q-learning converges to global optima. Adv. Neural Inform. Processing Systems 33:11312–11322.Google Scholar
Chen Z, Maguluri ST, Shakkottai S, Shanmugam K (2020) Finite-sample analysis of stochastic approximation using smooth convex envelopes. Preprint, submitted February 3, https://arxiv.org/abs/2002.00874.Google Scholar
Chen Z, Maguluri ST, Shakkottai S, Shanmugam K (2021) A Lyapunov theory for finite-sample guarantees of asynchronous Q-learning and TD-learning variants. Preprint, submitted February 2, https://arxiv.org/abs/2102.01567.Google Scholar
Chen Z, Zhang S, Doan TT, Maguluri ST, Clarke JP (2019) Performance of Q-learning with linear function approximation: Stability and finite-time analysis. Preprint, submitted May 27, https://arxiv.org/abs/1905.11425.Google Scholar
Devraj AM, Meyn SP (2020) Q-learning with uniformly bounded variance: Large discounting is not a barrier to fast learning. Preprint, submitted February 24, https://arxiv.org/abs/2002.10301.Google Scholar
Doan T, Maguluri S, Romberg J (2019) Finite-time analysis of distributed TD(0) with linear function approximation on multi-agent reinforcement learning. Internat. Conf. Machine Learn. (PMLR, New York), 1626–1635.Google Scholar
Even-Dar E, Mansour Y (2003) Learning rates for Q-learning. J. Machine Learn. Res. 5:1–25.Google Scholar
Fan J, Wang Z, Xie Y, Yang Z (2019) A theoretical analysis of deep Q-learning. Preprint, submitted January 1, https://arxiv.org/abs/1901.00137.Google Scholar
Freedman DA (1975) On tail probabilities for martingales. Ann. Probab. 3(1):100–118.Crossref, Google Scholar
Gupta H, Srikant R, Ying L (2019) Finite-time performance bounds and adaptive learning rate selection for two time-scale reinforcement learning. Adv. Neural Inform. Processing Systems 33:4706–4715.Google Scholar
Hasselt H (2010) Double Q-learning. Adv. Neural Inform. Processing Systems 23:2613–2621.Google Scholar
Hu J, Wellman MP (2003) Nash Q-learning for general-sum stochastic games. J. Machine Learn. Res. 4:1039–1069.Google Scholar
Jaakkola T, Jordan MI, Singh SP (1994) Convergence of stochastic iterative dynamic programming algorithms. Adv. Neural Inform. Processing Systems 6:703–710.Google Scholar
Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is Q-learning provably efficient? Adv. Neural Inform. Processing Systems 31:4863–4873.Google Scholar
Johnson R, Zhang T (2013) Accelerating stochastic gradient descent using predictive variance reduction. Adv. Neural Inform. Processing Systems 26:315–323.Google Scholar
Kakade S (2003) On the sample complexity of reinforcement learning. Unpublished PhD thesis, University of London, London.Google Scholar
Kearns MJ, Singh SP (1999) Finite-sample convergence rates for Q-learning and indirect algorithms. Adv. Neural Inform. Processing Systems 11:996–1002.Google Scholar
Kearns M, Mansour Y, Ng AY (2002) A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine Learn. 49(2–3):193–208.Crossref, Google Scholar
Khamaru K, Xia E, Wainwright MJ, Jordan MI (2021a) Instance-optimality in optimal value estimation: Adaptivity via variance-reduced Q-learning. Preprint, submitted June 28, https://arxiv.org/abs/2106.14352.Google Scholar
Khamaru K, Pananjady A, Ruan F, Wainwright MJ, Jordan MI (2021b) Is temporal difference learning optimal? An instance-dependent analysis. SIAM J. Math. Data Sci. 3(4):1013–1040.Crossref, Google Scholar
Lakshminarayanan C, Szepesvari C (2018) Linear stochastic approximation: How far does constant step-size and iterate averaging go? Internat. Conf. Artificial Intelligence Statist., 1347–1355.Google Scholar
Lee D, He N (2018) Stochastic primal-dual Q-learning. Preprint, submitted October 18, https://arxiv.org/abs/1810.08298.Google Scholar
Li G, Chi Y, Wei Y, Chen Y (2022a) Minimax-optimal multi-agent RL in Markov games with a generative model. Adv. Neural Inform. Processing Systems Forthcoming.Google Scholar
Li G, Shi L, Chen Y, Chi Y (2023a) Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning. Inform. Inference 12(2):969–1043.Crossref, Google Scholar
Li G, Wei Y, Chi Y, Chen Y (2023b) Breaking the sample size barrier in model-based reinforcement learning with a generative model. Oper. Res. Forthcoming.Google Scholar
Li G, Wei Y, Chi Y, Gu Y, Chen Y (2022b) Sample complexity of asynchronous Q-learning: Sharper analysis and variance reduction. IEEE Trans. Inform. Theory 68(1):448–473.Crossref, Google Scholar
Mou W, Li CJ, Wainwright MJ, Bartlett PL, Jordan MI (2020) On linear stochastic approximation: Fine-grained Polyak-Ruppert and non-asymptotic concentration. Preprint, submitted April 9, https://arxiv.org/abs/2004.04719.Google Scholar
Murphy S (2005) A generalization error for Q-learning. J. Machine Learn. Res. 6:1073–1097.Google Scholar
Pananjady A, Wainwright MJ (2020) Instance-dependent ℓ∞-bounds for policy evaluation in tabular reinforcement learning. IEEE Trans. Inform. Theory 67(1):566–585.Crossref, Google Scholar
Paulin D (2015) Concentration inequalities for Markov chains by Marton couplings and spectral methods. Electronic J. Probab. 20:1–32.Crossref, Google Scholar
Polyak BT, Juditsky AB (1992) Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30(4):838–855.Crossref, Google Scholar
Qu G, Wierman A (2020) Finite-time analysis of asynchronous stochastic approximation and Q-learning. Conf. Learn. Theory 3185–3205.Google Scholar
Robbins H, Monro S (1951) A stochastic approximation method. Ann. Math. Statist. 22(3):400–407.Crossref, Google Scholar
Shah D, Xie Q (2018) Q-learning with nearest neighbors. Adv. Neural Inform. Processing Systems 32:3111–3121.Google Scholar
Shi L, Li G, Wei Y, Chen Y, Chi Y (2022) Pessimistic Q-learning for offline reinforcement learning: Toward optimal sample complexity. Internat. Conf. Machine Learn. (PMLR, New York).Google Scholar
Sidford A, Wang M, Wu X, Yang L, Ye Y (2018) Near-optimal time and sample complexities for solving Markov decision processes with a generative model. Adv. Neural Inform. Processing Systems 31:5186–5196.Google Scholar
Srikant R, Ying L (2019) Finite-time error bounds for linear stochastic approximation and TD learning. Conf. Learn. Theory, 2803–2830.Google Scholar
Sutton RS (1988) Learning to predict by the methods of temporal differences. Machine Learn. 3(1):9–44.Crossref, Google Scholar
Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
Szepesvári C (1998) The asymptotic convergence-rate of Q-learning. Adv. Neural Inform. Processing Systems 10:1064–1070.Google Scholar
Tsitsiklis JN (1994) Asynchronous stochastic approximation and Q-learning. Machine Learn. 16(3):185–202.Crossref, Google Scholar
Tsitsiklis J, Van Roy B (1997) An analysis of temporal-difference learning with function approximation. IEEE Trans. Automatic Control 42(5):674–690.Crossref, Google Scholar
Tsybakov AB, Zaiats V (2009) Introduction to Nonparametric Estimation, vol. 11 (Springer, New York).Crossref, Google Scholar
Wai HT, Hong M, Yang Z, Wang Z, Tang K (2019) Variance reduced policy evaluation with smooth function approximation. Adv. Neural Inform. Processing Systems 32:5784–5795.Google Scholar
Wainwright M (2019a) High-Dimensional Statistics: A Non-Asymptotic Viewpoint, Cambridge Series in Statistical and Probabilistic Mathematics (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
Wainwright MJ (2019b) Stochastic approximation with cone-contractive operators: Sharp ℓ∞-bounds for Q-learning. Preprint, submitted May 15, https://arxiv.org/abs/1905.06265.Google Scholar
Wainwright MJ (2019c) Variance-reduced Q-learning is minimax optimal. Preprint, submitted June 11, https://arxiv.org/abs/1906.04697.Google Scholar
Watkins CJCH (1989) Learning from delayed rewards. PhD thesis, University of Cambridge, Cambridge, UK.Google Scholar
Watkins CJ, Dayan P (1992) Q-learning. Machine Learn. 8(3–4):279–292.Crossref, Google Scholar
Weng B, Xiong H, Zhao L, Liang Y, Zhang W (2020a) Momentum Q-learning with finite-sample convergence guarantee. Preprint, submitted April 9, https://arxiv.org/abs/2007.15418.Google Scholar
Weng W, Gupta H, He N, Ying L, Srikant R (2020b) The mean-squared error of double Q-learning. Adv. Neural Inform. Processing Systems 33:6815–6826.Google Scholar
Wu Y, Zhang W, Xu P, Gu Q (2020) A finite time analysis of two time-scale actor critic methods. Preprint, submitted May 4, https://arxiv.org/abs/2005.01350.Google Scholar
Xiong H, Zhao L, Liang Y, Zhang W (2020) Finite-time analysis for double Q-learning. Adv. Neural Inform. Processing Systems 33.Google Scholar
Xu P, Gu Q (2020) A finite-time analysis of Q-learning with neural network function approximation. Internat. Conf. Machine Learn. (PMLR, New York), 10555–10565.Google Scholar
Xu T, Zou S, Liang Y (2019a) Two time-scale off-policy TD learning: Non-asymptotic analysis over Markovian samples. Adv. Neural Inform. Processing Systems 33:10633–10643.Google Scholar
Xu T, Wang Z, Zhou Y, Liang Y (2019b) Reanalysis of variance reduced temporal difference learning. Internat. Conf. Learn. Representations.Google Scholar
Yan Y, Li G, Chen Y, Fan J (2022) The efficacy of pessimism in asynchronous Q-learning. Preprint, submitted March 14, https://arxiv.org/abs/2203.07368.Google Scholar
Zhang Z, Zhou Y, Ji X (2020) Almost optimal model-free reinforcement learning via reference-advantage decomposition. Adv. Neural Inform. Processing Systems 33.Google Scholar

Volume 72, Issue 1

January-February 2024

Pages iii-vi, 1-424, C2-C3

Article Information

Supplemental Material

Metrics

Information

Received:November 27, 2021
Accepted:March 16, 2023
Published Online:April 21, 2023

Cite as

Gen Li, Changxiao Cai, Yuxin Chen, Yuting Wei, Yuejie Chi (2023) Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis. Operations Research 72(1):222-236.

https://doi.org/10.1287/opre.2023.2450

Keywords

Acknowledgments

The authors are grateful to Laixi Shi for helpful discussions about the lower bound and thank Shaocong Ma for pointing out some errors in an early version of this work. Part of this work was done when G. Li, Y. Chen, and Y. Wei were visiting the Simons Institute for the Theory of Computing.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis

References

Volume 72, Issue 1

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News