A Lyapunov Theory for Finite-Sample Guarantees of Markovian Stochastic Approximation
References
- (1922) Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales. Fundamentals Math. 3(1):133–181.Crossref, Google Scholar
- (2019) Potential-function proofs for gradient methods. Theory Comput. 15(1):1–32.Google Scholar
- (2017) First-Order Methods in Optimization (SIAM, Philadelphia).Crossref, Google Scholar
- (2012) Smoothing and first order methods: A unified framework. SIAM J. Optim. 22(2):557–580.Crossref, Google Scholar
- (2013) Improved upper bounds on the expected error in constant step-size Q-learning. Proc. Amer. Control Conf. (IEEE, Piscataway, NJ), 1926–1931.Google Scholar
- (1996) A dynamical system approach to stochastic approximations. SIAM J. Control Optim. 34(2):437–472.Crossref, Google Scholar
- (2012) Adaptive Algorithms and Stochastic Approximations, vol. 22 (Springer Science & Business Media, Boston).Google Scholar
- (1996) Neuro-Dynamic Programming (Athena Scientific, Belmont, MA).Google Scholar
- (2018) A finite time analysis of temporal difference learning with linear function approximation. Proc. Conf. on Learning Theory, 1691–1692.Google Scholar
- (1997) Multiscale stochastic approximation for parametric optimization of hidden Markov models. Probability Engrg. Inform. Sci. 11(4):509–522.Crossref, Google Scholar
- (1998) A two timescale stochastic approximation scheme for simulation-based parametric optimization. Probability Engrg. Inform. Sci. 12(4):519–531.Crossref, Google Scholar
- (2009) Stochastic Approximation: A Dynamical Systems Viewpoint, vol. 48 (Springer, Berlin).Google Scholar
- (2000) The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim. 38(2):447–469.Crossref, Google Scholar
- (2018) Optimization methods for large-scale machine learning. SIAM Rev. 60(2):223–311.Crossref, Google Scholar
- (2020) Finite-sample analysis of contractive stochastic approximation using smooth convex envelopes. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in Neural Information Processing Systems (Curran Associates, Inc., Red Hook, NY), 8223–8234.Google Scholar
- (2022) Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning. Automatica J. IFAC 146:110623.Crossref, Google Scholar
- (2018) Finite sample analysis for TD(0) with function approximation. Proc. 32nd AAAI Conf. on Artificial Intelligence (AAAI, Washington, DC).Google Scholar
- (2017) Zap Q-learning. Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Adv. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY), 2235–2244.Google Scholar
- (2021) Finite-time analysis and restarting scheme for linear two-time-scale stochastic approximation. SIAM J. Control Optim. 59(4):2798–2819.Crossref, Google Scholar
- (2023) Finite-time analysis of Markov gradient descent. IEEE Trans. Automated Controls. 68(4):2140–2153.Crossref, Google Scholar
- (2012) Ergodic mirror descent. SIAM J. Optim. 22(4):1549–1578.Crossref, Google Scholar
- (2018) IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. Proc. Internat. Conf. on Machine Learn. (PMLR, New York), 1407–1416.Google Scholar
- (2003) Learning rates for Q-learning. J. Machine Learn. Res. 5(Dec):1–25.Google Scholar
- (1989) Importance sampling for stochastic simulations. Management Sci. 35(11):1367–1392.Link, Google Scholar
- (2006) Boundedness of iterates in Q-learning. Systems Control Lett. 55(4):347–349.Crossref, Google Scholar
- (2015) On lower complexity bounds for large-scale smooth convex optimization. J. Complexity 31(1):1–14.Crossref, Google Scholar
- (2016) Q(λ) with off-policy corrections. Proc. Internat. Conf. on Algorithmic Learn. Theory (Springer, Berlin), 305–320.Google Scholar
- (1993) Convergence of stochastic iterative dynamic programming algorithms. Cowan J, Tesauro G, Alspector J, eds. Adv. Neural Inform. Processing Systems (Morgan-Kaufmann, Burlington, MA), 703–710.Google Scholar
- (2018) Is Q-learning provably efficient? Proc. 32nd Internat. Conf. on Neural Inform. Processing Systems (ACM, New York), 4868–4878.Google Scholar
- (2020) Finite time analysis of linear two-timescale stochastic approximation with Markovian noise. Proc. Conf. on Learn. Theory (PMLR, New York), 2144–2203.Google Scholar
- (2021) Stochastic approximation with iterate-dependent Markov noise under verifiable conditions in compact state space with the stability of iterates not ensured. IEEE Trans. Automated Control. 66(12):5941–5954.Crossref, Google Scholar
- (2021) Finite-sample analysis of off-policy natural actor-critic algorithm. Proc. Internat. Conf. on Machine Learn. (PMLR, New York), 5420–5431.Google Scholar
- (1999) Actor-critic algorithms. Solla S, Leen T, Müller K, eds. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 1008–1014.Google Scholar
- (2010) Stochastic approximation: A survey. Wiley Interdisciplinary Rev. Comput. Statist. 2(1):87–96.Crossref, Google Scholar
- (2012) Stochastic Approximation Methods for Constrained and Unconstrained Systems, vol. 26 (Springer Science & Business Media, Boston).Google Scholar
- (2019) Torchbeast: A PyTorch platform for distributed RL. Preprint, submitted October 8, https://arxiv.org/abs/1910.03552.Google Scholar
- (2020) First-Order and Stochastic Optimization Methods for Machine Learning (Springer, Berlin).Crossref, Google Scholar
- (1997) Linear Algebra. Pure and Applied Mathematics: A Wiley Series of Texts, Monographs and Tracts (Wiley, New York).Google Scholar
- (2017) Markov Chains and Mixing Times, vol. 107 (American Mathematical Society, Providence, RI).Crossref, Google Scholar
- (2020) Sample complexity of asynchronous Q-learning: Sharper analysis and variance reduction. Adv. Neural Inform. Processing Systems 33:7031–7043.Google Scholar
- (1977) Analysis of recursive stochastic algorithms. IEEE Trans. Automated Control 22(4):551–575.Crossref, Google Scholar
- (2018) Learning to navigate in cities without a map. Proc. 32nd Internat. Conf. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY), 2424–2435.Google Scholar
- (2011) Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Adv. Neural Inform. Processing Systems 24:451–459.Google Scholar
- (2016) Safe and efficient off-policy reinforcement learning. Proc. 30th Internat. Conf. on Neural Inform. Processing Systems (ACM, New York), 1054–1062.Google Scholar
- (2009) Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4):1574–1609.Crossref, Google Scholar
- (2000) Eligibility traces for off-policy policy evaluation. Proc. 17th Internat. Conf. on Machine Learn. (ACM, New York), 759–766.Google Scholar
- (2020) Finite-time analysis of asynchronous stochastic approximation and Q-learning. Proc. Conf. on Learn. Theory (PMLR, New York), 3185–3205.Google Scholar
- (1951) A stochastic approximation method. Ann. Math. Statist. 22(3):400–407.Crossref, Google Scholar
- (2016) Primer on monotone operator methods. Appl. Comput. Math. 15(1):3–43.Google Scholar
- (1996) Reinforcement learning with replacing eligibility traces. Machine Learn. 22(1):123–158.Crossref, Google Scholar
- (2019) Finite-time error bounds for linear stochastic approximation and TD learning. Proc. Conf. on Learn. Theory (PMLR, New York), 2803–2830.Google Scholar
- (1988) Learning to predict by the methods of temporal differences. Machine Learn. 3(1):9–44.Crossref, Google Scholar
- (1999) Open theoretical questions in reinforcement learning. Proc. Eur. Conf. on Comput. Learn. Theory (Springer, Berlin), 11–17.Google Scholar
- (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
- (1997) The asymptotic convergence-rate of Q-learning. Jordan M, Kearns M, Solla S, eds. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 1064–1070.Google Scholar
- (2019) A concentration bound for stochastic approximation via Alekseev’s formula. Stochastic Systems 9(1):1–26.Link, Google Scholar
- (1994) Asynchronous stochastic approximation and Q-learning. Machine Learn. 16(3):185–202.Crossref, Google Scholar
- (1997) Analysis of temporal-difference learning with function approximation. IEEE Trans. Automatic Control 42(5):674–690.Crossref, Google Scholar
- (1999) Average cost temporal-difference learning. Automatica J. IFAC 35(11):1799–1808.Crossref, Google Scholar
- (2018) High-Dimensional Probability: An Introduction with Applications in Data Science, vol. 47 (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
- (2019) Stochastic approximation with cone-contractive operators: Sharp ℓ∞-bounds for Q-learning. Preprint, submitted May 15, https://arxiv.org/abs/1905.06265.Google Scholar
- (1989) Learning from delayed rewards. PhD thesis, King’s College, University of Cambridge, Cambridge, UK.Google Scholar
- (1992) Q-learning. Machine Learn. 8(3–4):279–292.Crossref, Google Scholar
- (2019) Analysis of stochastic approximation schemes with set-valued maps in the absence of a stability guarantee and their stabilization. IEEE Trans. Automated Control 65(3):1100–1115.Crossref, Google Scholar
- (2021) Finite-time analysis of decentralized stochastic approximation with applications in multi-agent and multi-task learning. Proc. 60th IEEE Conf. Decision and Control (IEEE, Piscataway, NJ).Google Scholar

