Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis

Published Online:https://doi.org/10.1287/opre.2023.2450

References

  • Agarwal A, Kakade S, Yang LF (2020) Model-based reinforcement learning with a generative model is minimax optimal. Proc. Conf. Learn. Theory (PMLR, New York), 67–83.Google Scholar
  • Azar MG, Munos R, Kappen HJ (2013) Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine Learn. 91(3):325–349.CrossrefGoogle Scholar
  • Azar MG, Munos R, Ghavamzadeh M, Kappen H (2011) Reinforcement learning with a near optimal rate of convergence. Technical report, INRIA.Google Scholar
  • Bai Y, Xie T, Jiang N, Wang YX (2019) Provably efficient Q-learning with low switching cost. Adv. Neural Inform. Processing Systems 33:8002–8011.Google Scholar
  • Beck CL, Srikant R (2012) Error bounds for constant step-size Q-learning. Systems Control Lett. 61(12):1203–1208.CrossrefGoogle Scholar
  • Bellman R (1952) On the theory of dynamic programming. Proc. Natl. Acad. Sci. USA 38(8):716–719.CrossrefGoogle Scholar
  • Bertsekas DP (2017) Dynamic Programming and Optimal Control, 4th ed. (Athena Scientific, Nashua, NH).Google Scholar
  • Bhandari J, Russo D, Singal R (2021) A finite time analysis of temporal difference learning with linear function approximation. Oper. Res. 69(3):950–973.LinkGoogle Scholar
  • Borkar VS, Meyn SP (2000) The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim. 38(2):447–469.CrossrefGoogle Scholar
  • Cai Q, Yang Z, Lee JD, Wang Z (2019) Neural temporal-difference and Q-learning converges to global optima. Adv. Neural Inform. Processing Systems 33:11312–11322.Google Scholar
  • Chen Z, Maguluri ST, Shakkottai S, Shanmugam K (2020) Finite-sample analysis of stochastic approximation using smooth convex envelopes. Preprint, submitted February 3, https://arxiv.org/abs/2002.00874.Google Scholar
  • Chen Z, Maguluri ST, Shakkottai S, Shanmugam K (2021) A Lyapunov theory for finite-sample guarantees of asynchronous Q-learning and TD-learning variants. Preprint, submitted February 2, https://arxiv.org/abs/2102.01567.Google Scholar
  • Chen Z, Zhang S, Doan TT, Maguluri ST, Clarke JP (2019) Performance of Q-learning with linear function approximation: Stability and finite-time analysis. Preprint, submitted May 27, https://arxiv.org/abs/1905.11425.Google Scholar
  • Devraj AM, Meyn SP (2020) Q-learning with uniformly bounded variance: Large discounting is not a barrier to fast learning. Preprint, submitted February 24, https://arxiv.org/abs/2002.10301.Google Scholar
  • Doan T, Maguluri S, Romberg J (2019) Finite-time analysis of distributed TD(0) with linear function approximation on multi-agent reinforcement learning. Internat. Conf. Machine Learn. (PMLR, New York), 1626–1635.Google Scholar
  • Even-Dar E, Mansour Y (2003) Learning rates for Q-learning. J. Machine Learn. Res. 5:1–25.Google Scholar
  • Fan J, Wang Z, Xie Y, Yang Z (2019) A theoretical analysis of deep Q-learning. Preprint, submitted January 1, https://arxiv.org/abs/1901.00137.Google Scholar
  • Freedman DA (1975) On tail probabilities for martingales. Ann. Probab. 3(1):100–118.CrossrefGoogle Scholar
  • Gupta H, Srikant R, Ying L (2019) Finite-time performance bounds and adaptive learning rate selection for two time-scale reinforcement learning. Adv. Neural Inform. Processing Systems 33:4706–4715.Google Scholar
  • Hasselt H (2010) Double Q-learning. Adv. Neural Inform. Processing Systems 23:2613–2621.Google Scholar
  • Hu J, Wellman MP (2003) Nash Q-learning for general-sum stochastic games. J. Machine Learn. Res. 4:1039–1069.Google Scholar
  • Jaakkola T, Jordan MI, Singh SP (1994) Convergence of stochastic iterative dynamic programming algorithms. Adv. Neural Inform. Processing Systems 6:703–710.Google Scholar
  • Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is Q-learning provably efficient? Adv. Neural Inform. Processing Systems 31:4863–4873.Google Scholar
  • Johnson R, Zhang T (2013) Accelerating stochastic gradient descent using predictive variance reduction. Adv. Neural Inform. Processing Systems 26:315–323.Google Scholar
  • Kakade S (2003) On the sample complexity of reinforcement learning. Unpublished PhD thesis, University of London, London.Google Scholar
  • Kearns MJ, Singh SP (1999) Finite-sample convergence rates for Q-learning and indirect algorithms. Adv. Neural Inform. Processing Systems 11:996–1002.Google Scholar
  • Kearns M, Mansour Y, Ng AY (2002) A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine Learn. 49(2–3):193–208.CrossrefGoogle Scholar
  • Khamaru K, Xia E, Wainwright MJ, Jordan MI (2021a) Instance-optimality in optimal value estimation: Adaptivity via variance-reduced Q-learning. Preprint, submitted June 28, https://arxiv.org/abs/2106.14352.Google Scholar
  • Khamaru K, Pananjady A, Ruan F, Wainwright MJ, Jordan MI (2021b) Is temporal difference learning optimal? An instance-dependent analysis. SIAM J. Math. Data Sci. 3(4):1013–1040.CrossrefGoogle Scholar
  • Lakshminarayanan C, Szepesvari C (2018) Linear stochastic approximation: How far does constant step-size and iterate averaging go? Internat. Conf. Artificial Intelligence Statist., 1347–1355.Google Scholar
  • Lee D, He N (2018) Stochastic primal-dual Q-learning. Preprint, submitted October 18, https://arxiv.org/abs/1810.08298.Google Scholar
  • Li G, Chi Y, Wei Y, Chen Y (2022a) Minimax-optimal multi-agent RL in Markov games with a generative model. Adv. Neural Inform. Processing Systems Forthcoming.Google Scholar
  • Li G, Shi L, Chen Y, Chi Y (2023a) Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning. Inform. Inference 12(2):969–1043.CrossrefGoogle Scholar
  • Li G, Wei Y, Chi Y, Chen Y (2023b) Breaking the sample size barrier in model-based reinforcement learning with a generative model. Oper. Res. Forthcoming.Google Scholar
  • Li G, Wei Y, Chi Y, Gu Y, Chen Y (2022b) Sample complexity of asynchronous Q-learning: Sharper analysis and variance reduction. IEEE Trans. Inform. Theory 68(1):448–473.CrossrefGoogle Scholar
  • Mou W, Li CJ, Wainwright MJ, Bartlett PL, Jordan MI (2020) On linear stochastic approximation: Fine-grained Polyak-Ruppert and non-asymptotic concentration. Preprint, submitted April 9, https://arxiv.org/abs/2004.04719.Google Scholar
  • Murphy S (2005) A generalization error for Q-learning. J. Machine Learn. Res. 6:1073–1097.Google Scholar
  • Pananjady A, Wainwright MJ (2020) Instance-dependent ℓ∞-bounds for policy evaluation in tabular reinforcement learning. IEEE Trans. Inform. Theory 67(1):566–585.CrossrefGoogle Scholar
  • Paulin D (2015) Concentration inequalities for Markov chains by Marton couplings and spectral methods. Electronic J. Probab. 20:1–32.CrossrefGoogle Scholar
  • Polyak BT, Juditsky AB (1992) Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30(4):838–855.CrossrefGoogle Scholar
  • Qu G, Wierman A (2020) Finite-time analysis of asynchronous stochastic approximation and Q-learning. Conf. Learn. Theory 3185–3205.Google Scholar
  • Robbins H, Monro S (1951) A stochastic approximation method. Ann. Math. Statist. 22(3):400–407.CrossrefGoogle Scholar
  • Shah D, Xie Q (2018) Q-learning with nearest neighbors. Adv. Neural Inform. Processing Systems 32:3111–3121.Google Scholar
  • Shi L, Li G, Wei Y, Chen Y, Chi Y (2022) Pessimistic Q-learning for offline reinforcement learning: Toward optimal sample complexity. Internat. Conf. Machine Learn. (PMLR, New York).Google Scholar
  • Sidford A, Wang M, Wu X, Yang L, Ye Y (2018) Near-optimal time and sample complexities for solving Markov decision processes with a generative model. Adv. Neural Inform. Processing Systems 31:5186–5196.Google Scholar
  • Srikant R, Ying L (2019) Finite-time error bounds for linear stochastic approximation and TD learning. Conf. Learn. Theory, 2803–2830.Google Scholar
  • Sutton RS (1988) Learning to predict by the methods of temporal differences. Machine Learn. 3(1):9–44.CrossrefGoogle Scholar
  • Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
  • Szepesvári C (1998) The asymptotic convergence-rate of Q-learning. Adv. Neural Inform. Processing Systems 10:1064–1070.Google Scholar
  • Tsitsiklis JN (1994) Asynchronous stochastic approximation and Q-learning. Machine Learn. 16(3):185–202.CrossrefGoogle Scholar
  • Tsitsiklis J, Van Roy B (1997) An analysis of temporal-difference learning with function approximation. IEEE Trans. Automatic Control 42(5):674–690.CrossrefGoogle Scholar
  • Tsybakov AB, Zaiats V (2009) Introduction to Nonparametric Estimation, vol. 11 (Springer, New York).CrossrefGoogle Scholar
  • Wai HT, Hong M, Yang Z, Wang Z, Tang K (2019) Variance reduced policy evaluation with smooth function approximation. Adv. Neural Inform. Processing Systems 32:5784–5795.Google Scholar
  • Wainwright M (2019a) High-Dimensional Statistics: A Non-Asymptotic Viewpoint, Cambridge Series in Statistical and Probabilistic Mathematics (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • Wainwright MJ (2019b) Stochastic approximation with cone-contractive operators: Sharp ℓ∞-bounds for Q-learning. Preprint, submitted May 15, https://arxiv.org/abs/1905.06265.Google Scholar
  • Wainwright MJ (2019c) Variance-reduced Q-learning is minimax optimal. Preprint, submitted June 11, https://arxiv.org/abs/1906.04697.Google Scholar
  • Watkins CJCH (1989) Learning from delayed rewards. PhD thesis, University of Cambridge, Cambridge, UK.Google Scholar
  • Watkins CJ, Dayan P (1992) Q-learning. Machine Learn. 8(3–4):279–292.CrossrefGoogle Scholar
  • Weng B, Xiong H, Zhao L, Liang Y, Zhang W (2020a) Momentum Q-learning with finite-sample convergence guarantee. Preprint, submitted April 9, https://arxiv.org/abs/2007.15418.Google Scholar
  • Weng W, Gupta H, He N, Ying L, Srikant R (2020b) The mean-squared error of double Q-learning. Adv. Neural Inform. Processing Systems 33:6815–6826.Google Scholar
  • Wu Y, Zhang W, Xu P, Gu Q (2020) A finite time analysis of two time-scale actor critic methods. Preprint, submitted May 4, https://arxiv.org/abs/2005.01350.Google Scholar
  • Xiong H, Zhao L, Liang Y, Zhang W (2020) Finite-time analysis for double Q-learning. Adv. Neural Inform. Processing Systems 33.Google Scholar
  • Xu P, Gu Q (2020) A finite-time analysis of Q-learning with neural network function approximation. Internat. Conf. Machine Learn. (PMLR, New York), 10555–10565.Google Scholar
  • Xu T, Zou S, Liang Y (2019a) Two time-scale off-policy TD learning: Non-asymptotic analysis over Markovian samples. Adv. Neural Inform. Processing Systems 33:10633–10643.Google Scholar
  • Xu T, Wang Z, Zhou Y, Liang Y (2019b) Reanalysis of variance reduced temporal difference learning. Internat. Conf. Learn. Representations.Google Scholar
  • Yan Y, Li G, Chen Y, Fan J (2022) The efficacy of pessimism in asynchronous Q-learning. Preprint, submitted March 14, https://arxiv.org/abs/2203.07368.Google Scholar
  • Zhang Z, Zhou Y, Ji X (2020) Almost optimal model-free reinforcement learning via reference-advantage decomposition. Adv. Neural Inform. Processing Systems 33.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.