A Lyapunov Theory for Finite-Sample Guarantees of Markovian Stochastic Approximation

Published Online:https://doi.org/10.1287/opre.2022.0249

References

  • Banach S (1922) Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales. Fundamentals Math. 3(1):133–181.CrossrefGoogle Scholar
  • Bansal N, Gupta A (2019) Potential-function proofs for gradient methods. Theory Comput. 15(1):1–32.Google Scholar
  • Beck A (2017) First-Order Methods in Optimization (SIAM, Philadelphia).CrossrefGoogle Scholar
  • Beck A, Teboulle M (2012) Smoothing and first order methods: A unified framework. SIAM J. Optim. 22(2):557–580.CrossrefGoogle Scholar
  • Beck CL, Srikant R (2013) Improved upper bounds on the expected error in constant step-size Q-learning. Proc. Amer. Control Conf. (IEEE, Piscataway, NJ), 1926–1931.Google Scholar
  • Benaim M (1996) A dynamical system approach to stochastic approximations. SIAM J. Control Optim. 34(2):437–472.CrossrefGoogle Scholar
  • Benveniste A, Métivier M, Priouret P (2012) Adaptive Algorithms and Stochastic Approximations, vol. 22 (Springer Science & Business Media, Boston).Google Scholar
  • Bertsekas DP, Tsitsiklis JN (1996) Neuro-Dynamic Programming (Athena Scientific, Belmont, MA).Google Scholar
  • Bhandari J, Russo D, Singal R (2018) A finite time analysis of temporal difference learning with linear function approximation. Proc. Conf. on Learning Theory, 1691–1692.Google Scholar
  • Bhatnagar S, Borkar VS (1997) Multiscale stochastic approximation for parametric optimization of hidden Markov models. Probability Engrg. Inform. Sci. 11(4):509–522.CrossrefGoogle Scholar
  • Bhatnagar S, Borkar VS (1998) A two timescale stochastic approximation scheme for simulation-based parametric optimization. Probability Engrg. Inform. Sci. 12(4):519–531.CrossrefGoogle Scholar
  • Borkar VS (2009) Stochastic Approximation: A Dynamical Systems Viewpoint, vol. 48 (Springer, Berlin).Google Scholar
  • Borkar VS, Meyn SP (2000) The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim. 38(2):447–469.CrossrefGoogle Scholar
  • Bottou L, Curtis FE, Nocedal J (2018) Optimization methods for large-scale machine learning. SIAM Rev. 60(2):223–311.CrossrefGoogle Scholar
  • Chen Z, Maguluri ST, Shakkottai S, Shanmugam K (2020) Finite-sample analysis of contractive stochastic approximation using smooth convex envelopes. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in Neural Information Processing Systems (Curran Associates, Inc., Red Hook, NY), 8223–8234.Google Scholar
  • Chen Z, Zhang S, Doan TT, Clarke JP, Maguluri ST (2022) Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning. Automatica J. IFAC 146:110623.CrossrefGoogle Scholar
  • Dalal G, Szörényi B, Thoppe G, Mannor S (2018) Finite sample analysis for TD(0) with function approximation. Proc. 32nd AAAI Conf. on Artificial Intelligence (AAAI, Washington, DC).Google Scholar
  • Devraj AM, Meyn S (2017) Zap Q-learning. Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Adv. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY), 2235–2244.Google Scholar
  • Doan TT (2021) Finite-time analysis and restarting scheme for linear two-time-scale stochastic approximation. SIAM J. Control Optim. 59(4):2798–2819.CrossrefGoogle Scholar
  • Doan TT (2023) Finite-time analysis of Markov gradient descent. IEEE Trans. Automated Controls. 68(4):2140–2153.CrossrefGoogle Scholar
  • Duchi JC, Agarwal A, Johansson M, Jordan MI (2012) Ergodic mirror descent. SIAM J. Optim. 22(4):1549–1578.CrossrefGoogle Scholar
  • Espeholt L, Soyer H, Munos R, Simonyan K, Mnih V, Ward T, Doron Y, et al. (2018) IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. Proc. Internat. Conf. on Machine Learn. (PMLR, New York), 1407–1416.Google Scholar
  • Even-Dar E, Mansour Y (2003) Learning rates for Q-learning. J. Machine Learn. Res. 5(Dec):1–25.Google Scholar
  • Glynn PW, Iglehart DL (1989) Importance sampling for stochastic simulations. Management Sci. 35(11):1367–1392.LinkGoogle Scholar
  • Gosavi A (2006) Boundedness of iterates in Q-learning. Systems Control Lett. 55(4):347–349.CrossrefGoogle Scholar
  • Guzmán C, Nemirovski A (2015) On lower complexity bounds for large-scale smooth convex optimization. J. Complexity 31(1):1–14.CrossrefGoogle Scholar
  • Harutyunyan A, Bellemare MG, Stepleton T, Munos R (2016) Q(λ) with off-policy corrections. Proc. Internat. Conf. on Algorithmic Learn. Theory (Springer, Berlin), 305–320.Google Scholar
  • Jaakkola T, Jordan MI, Singh SP (1993) Convergence of stochastic iterative dynamic programming algorithms. Cowan J, Tesauro G, Alspector J, eds. Adv. Neural Inform. Processing Systems (Morgan-Kaufmann, Burlington, MA), 703–710.Google Scholar
  • Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is Q-learning provably efficient? Proc. 32nd Internat. Conf. on Neural Inform. Processing Systems (ACM, New York), 4868–4878.Google Scholar
  • Kaledin M, Moulines E, Naumov A, Tadic V, Wai HT (2020) Finite time analysis of linear two-timescale stochastic approximation with Markovian noise. Proc. Conf. on Learn. Theory (PMLR, New York), 2144–2203.Google Scholar
  • Karmakar P, Bhatnagar S (2021) Stochastic approximation with iterate-dependent Markov noise under verifiable conditions in compact state space with the stability of iterates not ensured. IEEE Trans. Automated Control. 66(12):5941–5954.CrossrefGoogle Scholar
  • Khodadadian S, Chen Z, Maguluri ST (2021) Finite-sample analysis of off-policy natural actor-critic algorithm. Proc. Internat. Conf. on Machine Learn. (PMLR, New York), 5420–5431.Google Scholar
  • Konda V, Tsitsiklis J (1999) Actor-critic algorithms. Solla S, Leen T, Müller K, eds. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 1008–1014.Google Scholar
  • Kushner H (2010) Stochastic approximation: A survey. Wiley Interdisciplinary Rev. Comput. Statist. 2(1):87–96.CrossrefGoogle Scholar
  • Kushner HJ, Clark DS (2012) Stochastic Approximation Methods for Constrained and Unconstrained Systems, vol. 26 (Springer Science & Business Media, Boston).Google Scholar
  • Küttler H, Nardelli N, Lavril T, Selvatici M, Sivakumar V, Rocktäschel T, Grefenstette E (2019) Torchbeast: A PyTorch platform for distributed RL. Preprint, submitted October 8, https://arxiv.org/abs/1910.03552.Google Scholar
  • Lan G (2020) First-Order and Stochastic Optimization Methods for Machine Learning (Springer, Berlin).CrossrefGoogle Scholar
  • Lax P (1997) Linear Algebra. Pure and Applied Mathematics: A Wiley Series of Texts, Monographs and Tracts (Wiley, New York).Google Scholar
  • Levin DA, Peres Y (2017) Markov Chains and Mixing Times, vol. 107 (American Mathematical Society, Providence, RI).CrossrefGoogle Scholar
  • Li G, Wei Y, Chi Y, Gu Y, Chen Y (2020) Sample complexity of asynchronous Q-learning: Sharper analysis and variance reduction. Adv. Neural Inform. Processing Systems 33:7031–7043.Google Scholar
  • Ljung L (1977) Analysis of recursive stochastic algorithms. IEEE Trans. Automated Control 22(4):551–575.CrossrefGoogle Scholar
  • Mirowski P, Grimes M, Malinowski M, Hermann KM, Anderson K, Teplyashin D, Simonyan K, et al. (2018) Learning to navigate in cities without a map. Proc. 32nd Internat. Conf. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY), 2424–2435.Google Scholar
  • Moulines E, Bach F (2011) Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Adv. Neural Inform. Processing Systems 24:451–459.Google Scholar
  • Munos R, Stepleton T, Harutyunyan A, Bellemare MG (2016) Safe and efficient off-policy reinforcement learning. Proc. 30th Internat. Conf. on Neural Inform. Processing Systems (ACM, New York), 1054–1062.Google Scholar
  • Nemirovski A, Juditsky A, Lan G, Shapiro A (2009) Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4):1574–1609.CrossrefGoogle Scholar
  • Precup D, Sutton RS, Singh SP (2000) Eligibility traces for off-policy policy evaluation. Proc. 17th Internat. Conf. on Machine Learn. (ACM, New York), 759–766.Google Scholar
  • Qu G, Wierman A (2020) Finite-time analysis of asynchronous stochastic approximation and Q-learning. Proc. Conf. on Learn. Theory (PMLR, New York), 3185–3205.Google Scholar
  • Robbins H, Monro S (1951) A stochastic approximation method. Ann. Math. Statist. 22(3):400–407.CrossrefGoogle Scholar
  • Ryu EK, Boyd S (2016) Primer on monotone operator methods. Appl. Comput. Math. 15(1):3–43.Google Scholar
  • Singh SP, Sutton RS (1996) Reinforcement learning with replacing eligibility traces. Machine Learn. 22(1):123–158.CrossrefGoogle Scholar
  • Srikant R, Ying L (2019) Finite-time error bounds for linear stochastic approximation and TD learning. Proc. Conf. on Learn. Theory (PMLR, New York), 2803–2830.Google Scholar
  • Sutton RS (1988) Learning to predict by the methods of temporal differences. Machine Learn. 3(1):9–44.CrossrefGoogle Scholar
  • Sutton RS (1999) Open theoretical questions in reinforcement learning. Proc. Eur. Conf. on Comput. Learn. Theory (Springer, Berlin), 11–17.Google Scholar
  • Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
  • Szepesvári C, et al. (1997) The asymptotic convergence-rate of Q-learning. Jordan M, Kearns M, Solla S, eds. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 1064–1070.Google Scholar
  • Thoppe G, Borkar V (2019) A concentration bound for stochastic approximation via Alekseev’s formula. Stochastic Systems 9(1):1–26.LinkGoogle Scholar
  • Tsitsiklis JN (1994) Asynchronous stochastic approximation and Q-learning. Machine Learn. 16(3):185–202.CrossrefGoogle Scholar
  • Tsitsiklis JN, Van Roy B (1997) Analysis of temporal-difference learning with function approximation. IEEE Trans. Automatic Control 42(5):674–690.CrossrefGoogle Scholar
  • Tsitsiklis JN, Van Roy B (1999) Average cost temporal-difference learning. Automatica J. IFAC 35(11):1799–1808.CrossrefGoogle Scholar
  • Vershynin R (2018) High-Dimensional Probability: An Introduction with Applications in Data Science, vol. 47 (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • Wainwright MJ (2019) Stochastic approximation with cone-contractive operators: Sharp ℓ∞-bounds for Q-learning. Preprint, submitted May 15, https://arxiv.org/abs/1905.06265.Google Scholar
  • Watkins C (1989) Learning from delayed rewards. PhD thesis, King’s College, University of Cambridge, Cambridge, UK.Google Scholar
  • Watkins CJ, Dayan P (1992) Q-learning. Machine Learn. 8(3–4):279–292.CrossrefGoogle Scholar
  • Yaji VG, Bhatnagar S (2019) Analysis of stochastic approximation schemes with set-valued maps in the absence of a stability guarantee and their stabilization. IEEE Trans. Automated Control 65(3):1100–1115.CrossrefGoogle Scholar
  • Zeng S, Doan TT, Romberg J (2021) Finite-time analysis of decentralized stochastic approximation with applications in multi-agent and multi-task learning. Proc. 60th IEEE Conf. Decision and Control (IEEE, Piscataway, NJ).Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.