Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model
References
- (2020) Model-based reinforcement learning with a generative model is minimax optimal. Conf. Learn. Theory (COLT) (PMLR, New York), 67–83.Google Scholar
- (2012) On the sample complexity of reinforcement learning with a generative model. Preprint, submitted June 27, https://arxiv.org/abs/1206.6461.Google Scholar
- (2013) Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine Learn. 91(3):325–349.Crossref, Google Scholar
- (2017) Minimax regret bounds for reinforcement learning. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (PMLR, New York), 263–272.Google Scholar
- (2019) Provably efficient Q-learning with low switching cost. Adv. Neural Inform. Processing Systems 32 (NeurIPS, San Diego).Google Scholar
- (2012) Error bounds for constant step-size Q-learning. Systems Control Lett. 61(12):1203–1208.Crossref, Google Scholar
- (1952) On the theory of dynamic programming. Proc. Natl. Acad. Sci. USA 38(8):716–719.Crossref, Google Scholar
- (2017) Dynamic Programming and Optimal Control, 4th ed. (Athena Scientific, Nashua, NH).Google Scholar
- (2018) A finite time analysis of temporal difference learning with linear function approximation. Conf. Learn. Theory (COLT) (PMLR, New York), 1691–1692.Google Scholar
- (1996) Linear least-squares algorithms for temporal difference learning. Machine Learn. 22(1–3):33–57.Crossref, Google Scholar
- (2019) Neural temporal-difference learning converges to global optima. Adv. Neural Inform. Processing Systems (NeurIPS, San Diego), 11312–11322.Google Scholar
- (2021) Spectral methods for data science: A statistical perspective. Foundations Trends Machine Learn. 14(5):566–806.Crossref, Google Scholar
- (2019a) Spectral method and regularized MLE are both optimal for top-K ranking. Ann. Statist. 47(4):2204–2235.Crossref, Google Scholar
- (2019b) Inference and uncertainty quantification for noisy matrix completion. Proc. Natl. Acad. Sci. USA 116(46):22931–22937.Crossref, Google Scholar
- (2020) Finite-sample analysis of contractive stochastic approximation using smooth convex envelopes. Preprint, submitted October 21, https://arxiv.org/abs/2002.00874v4.Google Scholar
- (2021) Minimax sample complexity for turn-based stochastic game. Proc. Thirty-Seventh Conf. Uncertainty Artificial Intelligence, vol. 161 (PMLR, New York), 1496–1504.Google Scholar
- (2018) Finite sample analyses for TD(0) with function approximation. Thirty-Second AAAI Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA).Google Scholar
- (2021) Episodic reinforcement learning in finite MDPs: Minimax lower bounds revisited. Algorithmic Learn. Theory (PMLR, New York), 578–598.Google Scholar
- (2018) On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators. Probab. Theory Related Fields 170:95–175.Crossref, Google Scholar
- (2003) Learning rates for Q-learning. J. Mach. Learn. Res. 5(December):1–25.Google Scholar
- (2019) A theoretical analysis of deep Q-learning. Preprint, submitted May 29, https://arxiv.org/abs/1901.00137v2.Google Scholar
- (2019) Finite-time performance bounds and adaptive learning rate selection for two time-scale reinforcement learning. Adv. Neural Inform. Processing Systems (NeurIPS, San Diego), 4706–4715.Google Scholar
- (1994) Convergence of stochastic iterative dynamic programming algorithms. Adv. Neural Inform. Processing Systems (NeurIPS, San Diego), 703–710.Google Scholar
- (2018) Is Q-learning provably efficient? Adv. Neural Inform. Processing Systems (NeurIPS, San Diego), 4863–4873.Google Scholar
- (2020) Provably efficient reinforcement learning with linear function approximation. Conf. Learn. Theory (PMLR, New York), 2137–2143.Google Scholar
- (2003) On the sample complexity of reinforcement learning. PhD thesis, University of London, London.Google Scholar
- (2020) Finite time analysis of linear two-timescale stochastic approximation with Markovian noise. Preprint, submitted February 4, https://arxiv.org/abs/2002.01268.Google Scholar
- (1999) Finite-sample convergence rates for Q-learning and indirect algorithms. Adv. Neural Inform. Processing Systems (NeurIPS, San Diego), 996–1002.Google Scholar
- (2002) A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine Learn. 49(2–3):193–208.Crossref, Google Scholar
- (2020) Is temporal difference learning optimal? An instance-dependent analysis. Preprint, submitted March 16, https://arxiv.org/abs/2003.07337.Google Scholar
- (2018) Linear stochastic approximation: How far does constant step-size and iterate averaging go? Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 1347–1355.Google Scholar
- (2012) PAC bounds for discounted MDPs. Internat. Conf. Algorithmic Learn. Theory (Springer, Berlin), 320–334.Google Scholar
- (2022a) Minimax-optimal multi-agent RL in zero-sum Markov games with a generative model. Preprint, submitted October 12, https://arxiv.org/abs/2208.10458.Google Scholar
- (2023) Is Q-learning minimax optimal? A tight sample complexity analysis. Oper. Res. Forthcoming.Google Scholar
- (2021a) Sample-efficient reinforcement learning is feasible for linearly realizable MDPs with limited revisiting. Adv. Neural Inform. Processing Systems 34 (NeurIPS, San Diego), 16671–16685.Google Scholar
- (2022b) Settling the sample complexity of model-based offline reinforcement learning. Preprint, submitted April 11, https://arxiv.org/abs/2204.05275v1.Google Scholar
- (2021b) Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning. Adv. Neural Inform. Processing Systems 34 (NeurIPS, San Diego), 17762–17776.Google Scholar
- (2020) Breaking the sample size barrier in model-based reinforcement learning with a generative model. Adv. Neural Inform. Processing Systems 33 (NeurIPS, San Diego), 12861–12872.Google Scholar
- (2022c) Sample complexity of asynchronous Q-learning: Sharper analysis and variance reduction. IEEE Trans. Inform. Theory 68(1):448–473.Crossref, Google Scholar
- (2021c) Tightening the dependence on horizon in the sample complexity of Q-learning. Internat. Conf. Machine Learn. (PMLR, New York), 6296–6306.Google Scholar
- (2020) Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval, matrix completion and blind deconvolution. Foundations Comput. Math. 20(3):451–632.Crossref, Google Scholar
- (2020) On linear stochastic approximation: Fine-grained Polyak-Ruppert and non-asymptotic concentration. Preprint, submitted April 9, https://arxiv.org/abs/2004.04719.Google Scholar
- (2019) Value function estimation in Markov reward processes: Instance-dependent ℓ∞-bounds for policy evaluation. Preprint, submitted September 19, https://arxiv.org/abs/1909.08749v1.Google Scholar
- (2020) Finite-time analysis of asynchronous stochastic approximation and Q-learning. Conf. Learn. Theory, 3185–3205.Google Scholar
- (2018) Q-learning with nearest neighbors. Adv. Neural Inform. Processing Systems (NeurIPS, San Diego), 3111–3121.Google Scholar
- (2022) Pessimistic Q-learning for offline reinforcement learning: Toward optimal sample complexity. Internat. Conf. Machine Learn. (PMLR, New York).Google Scholar
- (2018a) Variance reduced value iteration and faster algorithms for solving Markov decision processes. Proc. Twenty-Ninth Annual ACM-SIAM Sympos. Discrete Algorithms, 770–787.Google Scholar
- (2018b) Near-optimal time and sample complexities for solving Markov decision processes with a generative model. Adv. Neural Inform. Processing Systems, 5186–5196.Google Scholar
- (2019) Finite-time error bounds for linear stochastic approximation and TD learning. Conf. Learn. Theory (PMLR, New York), 2803–2830.Google Scholar
- (2006) PAC model-free reinforcement learning. Internat. Conf. Machine Learn. (PMLR, New York), 881–888.Google Scholar
- (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
- (1998) The asymptotic convergence-rate of Q-learning. Adv. Neural Inform. Processing Systems (NeurIPS, San Diego), 1064–1070.Google Scholar
- (2010) Algorithms for Reinforcement Learning, Synthesis Lectures on Artificial Intelligence and Machine Learning (Springer, Cham, Switzerland).Crossref, Google Scholar
- (1994) Asynchronous stochastic approximation and Q-learning. Machine Learn. 16(3):185–202.Crossref, Google Scholar
- (1997) An analysis of temporal-difference learning with function approximation. IEEE Trans. Automatic Control 42(5):674–690.Crossref, Google Scholar
- (2019) The gap between model-based and model-free methods on the linear quadratic regulator: An asymptotic viewpoint. Conf. Learn. Theory (PMLR, New York), 3036–3083.Google Scholar
- (2018) High-Dimensional Probability: An Introduction with Applications in Data Science, vol. 47 (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
- (2019a) Stochastic approximation with cone-contractive operators: Sharp ℓ∞-bounds for Q-learning. Preprint, submitted June 24, https://arxiv.org/abs/1905.06265v2.Google Scholar
- (2019b) Variance-reduced Q-learning is minimax optimal. Preprint, submitted August 8, https://arxiv.org/abs/1906.04697.Google Scholar
- (2020) Randomized linear programming solves the Markov decision problem in nearly linear (sometimes sublinear) time. Math. Oper. Res. 45(2):517–546.Link, Google Scholar
- (2021) Sample-efficient reinforcement learning for linearly-parameterized MDPs with a generative model. Neural Inform. Processing Systems (NeurIPS, San Diego), 23009–23022.Google Scholar
- (2020) A finite-time analysis of Q-learning with neural network function approximation. Internat. Conf. Machine Learn. (PMLR, New York), 10555–10565.Google Scholar
- (2019) Two time-scale off-policy TD learning: Non-asymptotic analysis over Markovian samples. Adv. Neural Inform. Processing Systems (NeurIPS, San Diego), 10633–10643.Google Scholar
- (2021) Inference for heteroskedastic PCA with missing data. Preprint, submitted July 26, https://arxiv.org/abs/2107.12365.Google Scholar
- (2022a) Model-based reinforcement learning is minimax-optimal for offline zero-sum Markov games. Preprint, submitted June 8, https://arxiv.org/abs/2206.04044.Google Scholar
- (2022b) The efficacy of pessimism in asynchronous Q-learning. Preprint, submitted March 14, https://arxiv.org/abs/2203.07368.Google Scholar
- (2019) Sample-optimal parametric Q-learning using linearly additive features. Internat. Conf. Machine Learn. (PMLR, New York), 6995–7004.Google Scholar
- (2021) Near-optimal provable uniform convergence in offline policy evaluation for reinforcement learning. Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 1567–1575.Google Scholar

