Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis
References
- (2020) Model-based reinforcement learning with a generative model is minimax optimal. Proc. Conf. Learn. Theory (PMLR, New York), 67–83.Google Scholar
- (2013) Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine Learn. 91(3):325–349.Crossref, Google Scholar
- (2011) Reinforcement learning with a near optimal rate of convergence. Technical report, INRIA.Google Scholar
- (2019) Provably efficient Q-learning with low switching cost. Adv. Neural Inform. Processing Systems 33:8002–8011.Google Scholar
- (2012) Error bounds for constant step-size Q-learning. Systems Control Lett. 61(12):1203–1208.Crossref, Google Scholar
- (1952) On the theory of dynamic programming. Proc. Natl. Acad. Sci. USA 38(8):716–719.Crossref, Google Scholar
- (2017) Dynamic Programming and Optimal Control, 4th ed. (Athena Scientific, Nashua, NH).Google Scholar
- (2021) A finite time analysis of temporal difference learning with linear function approximation. Oper. Res. 69(3):950–973.Link, Google Scholar
- (2000) The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim. 38(2):447–469.Crossref, Google Scholar
- (2019) Neural temporal-difference and Q-learning converges to global optima. Adv. Neural Inform. Processing Systems 33:11312–11322.Google Scholar
- (2020) Finite-sample analysis of stochastic approximation using smooth convex envelopes. Preprint, submitted February 3, https://arxiv.org/abs/2002.00874.Google Scholar
- (2021) A Lyapunov theory for finite-sample guarantees of asynchronous Q-learning and TD-learning variants. Preprint, submitted February 2, https://arxiv.org/abs/2102.01567.Google Scholar
- (2019) Performance of Q-learning with linear function approximation: Stability and finite-time analysis. Preprint, submitted May 27, https://arxiv.org/abs/1905.11425.Google Scholar
- (2020) Q-learning with uniformly bounded variance: Large discounting is not a barrier to fast learning. Preprint, submitted February 24, https://arxiv.org/abs/2002.10301.Google Scholar
- (2019) Finite-time analysis of distributed TD(0) with linear function approximation on multi-agent reinforcement learning. Internat. Conf. Machine Learn. (PMLR, New York), 1626–1635.Google Scholar
- (2003) Learning rates for Q-learning. J. Machine Learn. Res. 5:1–25.Google Scholar
- (2019) A theoretical analysis of deep Q-learning. Preprint, submitted January 1, https://arxiv.org/abs/1901.00137.Google Scholar
- (1975) On tail probabilities for martingales. Ann. Probab. 3(1):100–118.Crossref, Google Scholar
- (2019) Finite-time performance bounds and adaptive learning rate selection for two time-scale reinforcement learning. Adv. Neural Inform. Processing Systems 33:4706–4715.Google Scholar
- (2010) Double Q-learning. Adv. Neural Inform. Processing Systems 23:2613–2621.Google Scholar
- (2003) Nash Q-learning for general-sum stochastic games. J. Machine Learn. Res. 4:1039–1069.Google Scholar
- (1994) Convergence of stochastic iterative dynamic programming algorithms. Adv. Neural Inform. Processing Systems 6:703–710.Google Scholar
- (2018) Is Q-learning provably efficient? Adv. Neural Inform. Processing Systems 31:4863–4873.Google Scholar
- (2013) Accelerating stochastic gradient descent using predictive variance reduction. Adv. Neural Inform. Processing Systems 26:315–323.Google Scholar
- (2003) On the sample complexity of reinforcement learning. Unpublished PhD thesis, University of London, London.Google Scholar
- (1999) Finite-sample convergence rates for Q-learning and indirect algorithms. Adv. Neural Inform. Processing Systems 11:996–1002.Google Scholar
- (2002) A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine Learn. 49(2–3):193–208.Crossref, Google Scholar
- (2021a) Instance-optimality in optimal value estimation: Adaptivity via variance-reduced Q-learning. Preprint, submitted June 28, https://arxiv.org/abs/2106.14352.Google Scholar
- (2021b) Is temporal difference learning optimal? An instance-dependent analysis. SIAM J. Math. Data Sci. 3(4):1013–1040.Crossref, Google Scholar
- (2018) Linear stochastic approximation: How far does constant step-size and iterate averaging go? Internat. Conf. Artificial Intelligence Statist., 1347–1355.Google Scholar
- (2018) Stochastic primal-dual Q-learning. Preprint, submitted October 18, https://arxiv.org/abs/1810.08298.Google Scholar
- (2022a) Minimax-optimal multi-agent RL in Markov games with a generative model. Adv. Neural Inform. Processing Systems Forthcoming.Google Scholar
- (2023a) Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning. Inform. Inference 12(2):969–1043.Crossref, Google Scholar
- (2023b) Breaking the sample size barrier in model-based reinforcement learning with a generative model. Oper. Res. Forthcoming.Google Scholar
- (2022b) Sample complexity of asynchronous Q-learning: Sharper analysis and variance reduction. IEEE Trans. Inform. Theory 68(1):448–473.Crossref, Google Scholar
- (2020) On linear stochastic approximation: Fine-grained Polyak-Ruppert and non-asymptotic concentration. Preprint, submitted April 9, https://arxiv.org/abs/2004.04719.Google Scholar
- (2005) A generalization error for Q-learning. J. Machine Learn. Res. 6:1073–1097.Google Scholar
- (2020) Instance-dependent ℓ∞-bounds for policy evaluation in tabular reinforcement learning. IEEE Trans. Inform. Theory 67(1):566–585.Crossref, Google Scholar
- (2015) Concentration inequalities for Markov chains by Marton couplings and spectral methods. Electronic J. Probab. 20:1–32.Crossref, Google Scholar
- (1992) Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30(4):838–855.Crossref, Google Scholar
- (2020) Finite-time analysis of asynchronous stochastic approximation and Q-learning. Conf. Learn. Theory 3185–3205.Google Scholar
- (1951) A stochastic approximation method. Ann. Math. Statist. 22(3):400–407.Crossref, Google Scholar
- (2018) Q-learning with nearest neighbors. Adv. Neural Inform. Processing Systems 32:3111–3121.Google Scholar
- (2022) Pessimistic Q-learning for offline reinforcement learning: Toward optimal sample complexity. Internat. Conf. Machine Learn. (PMLR, New York).Google Scholar
- (2018) Near-optimal time and sample complexities for solving Markov decision processes with a generative model. Adv. Neural Inform. Processing Systems 31:5186–5196.Google Scholar
- (2019) Finite-time error bounds for linear stochastic approximation and TD learning. Conf. Learn. Theory, 2803–2830.Google Scholar
- (1988) Learning to predict by the methods of temporal differences. Machine Learn. 3(1):9–44.Crossref, Google Scholar
- (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
- (1998) The asymptotic convergence-rate of Q-learning. Adv. Neural Inform. Processing Systems 10:1064–1070.Google Scholar
- (1994) Asynchronous stochastic approximation and Q-learning. Machine Learn. 16(3):185–202.Crossref, Google Scholar
- (1997) An analysis of temporal-difference learning with function approximation. IEEE Trans. Automatic Control 42(5):674–690.Crossref, Google Scholar
- (2009) Introduction to Nonparametric Estimation, vol. 11 (Springer, New York).Crossref, Google Scholar
- (2019) Variance reduced policy evaluation with smooth function approximation. Adv. Neural Inform. Processing Systems 32:5784–5795.Google Scholar
- (2019a) High-Dimensional Statistics: A Non-Asymptotic Viewpoint, Cambridge Series in Statistical and Probabilistic Mathematics (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
- (2019b) Stochastic approximation with cone-contractive operators: Sharp ℓ∞-bounds for Q-learning. Preprint, submitted May 15, https://arxiv.org/abs/1905.06265.Google Scholar
- (2019c) Variance-reduced Q-learning is minimax optimal. Preprint, submitted June 11, https://arxiv.org/abs/1906.04697.Google Scholar
- (1989) Learning from delayed rewards. PhD thesis, University of Cambridge, Cambridge, UK.Google Scholar
- (1992) Q-learning. Machine Learn. 8(3–4):279–292.Crossref, Google Scholar
- (2020a) Momentum Q-learning with finite-sample convergence guarantee. Preprint, submitted April 9, https://arxiv.org/abs/2007.15418.Google Scholar
- (2020b) The mean-squared error of double Q-learning. Adv. Neural Inform. Processing Systems 33:6815–6826.Google Scholar
- (2020) A finite time analysis of two time-scale actor critic methods. Preprint, submitted May 4, https://arxiv.org/abs/2005.01350.Google Scholar
- (2020) Finite-time analysis for double Q-learning. Adv. Neural Inform. Processing Systems 33.Google Scholar
- (2020) A finite-time analysis of Q-learning with neural network function approximation. Internat. Conf. Machine Learn. (PMLR, New York), 10555–10565.Google Scholar
- (2019a) Two time-scale off-policy TD learning: Non-asymptotic analysis over Markovian samples. Adv. Neural Inform. Processing Systems 33:10633–10643.Google Scholar
- (2019b) Reanalysis of variance reduced temporal difference learning. Internat. Conf. Learn. Representations.Google Scholar
- (2022) The efficacy of pessimism in asynchronous Q-learning. Preprint, submitted March 14, https://arxiv.org/abs/2203.07368.Google Scholar
- (2020) Almost optimal model-free reinforcement learning via reference-advantage decomposition. Adv. Neural Inform. Processing Systems 33.Google Scholar

