A Lyapunov Theory for Finite-Sample Guarantees of Markovian Stochastic Approximation

Zaiwei Chen
Corresponding Author
Zaiwei Chen
[email protected]
https://orcid.org/0000-0001-9915-5595
The School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332;
Search for more papers by this author
,
Siva T. Maguluri
Siva T. Maguluri
[email protected]
https://orcid.org/0000-0002-5797-1639
The School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332;
Search for more papers by this author
,
Sanjay Shakkottai
Sanjay Shakkottai
[email protected]
https://orcid.org/0000-0002-4325-9050
Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, Texas 78712;
Search for more papers by this author
,
Karthikeyan Shanmugam
Karthikeyan Shanmugam
[email protected]
https://orcid.org/0009-0008-2879-5868
IBM Research AI Group, Yorktown Heights, New York 10598
Search for more papers by this author

Zaiwei Chen

Corresponding Author

Zaiwei Chen

[email protected]

https://orcid.org/0000-0001-9915-5595

The School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332;

Search for more papers by this author

Siva T. Maguluri

[email protected]

https://orcid.org/0000-0002-5797-1639

The School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332;

Search for more papers by this author

Sanjay Shakkottai

[email protected]

https://orcid.org/0000-0002-4325-9050

Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, Texas 78712;

Search for more papers by this author

Karthikeyan Shanmugam

[email protected]

https://orcid.org/0009-0008-2879-5868

IBM Research AI Group, Yorktown Heights, New York 10598

Search for more papers by this author

Published Online:6 Oct 2023https://doi.org/10.1287/opre.2022.0249

References

Banach S (1922) Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales. Fundamentals Math. 3(1):133–181.Crossref, Google Scholar
Bansal N, Gupta A (2019) Potential-function proofs for gradient methods. Theory Comput. 15(1):1–32.Google Scholar
Beck A (2017) First-Order Methods in Optimization (SIAM, Philadelphia).Crossref, Google Scholar
Beck A, Teboulle M (2012) Smoothing and first order methods: A unified framework. SIAM J. Optim. 22(2):557–580.Crossref, Google Scholar
Beck CL, Srikant R (2013) Improved upper bounds on the expected error in constant step-size Q-learning. Proc. Amer. Control Conf. (IEEE, Piscataway, NJ), 1926–1931.Google Scholar
Benaim M (1996) A dynamical system approach to stochastic approximations. SIAM J. Control Optim. 34(2):437–472.Crossref, Google Scholar
Benveniste A, Métivier M, Priouret P (2012) Adaptive Algorithms and Stochastic Approximations, vol. 22 (Springer Science & Business Media, Boston).Google Scholar
Bertsekas DP, Tsitsiklis JN (1996) Neuro-Dynamic Programming (Athena Scientific, Belmont, MA).Google Scholar
Bhandari J, Russo D, Singal R (2018) A finite time analysis of temporal difference learning with linear function approximation. Proc. Conf. on Learning Theory, 1691–1692.Google Scholar
Bhatnagar S, Borkar VS (1997) Multiscale stochastic approximation for parametric optimization of hidden Markov models. Probability Engrg. Inform. Sci. 11(4):509–522.Crossref, Google Scholar
Bhatnagar S, Borkar VS (1998) A two timescale stochastic approximation scheme for simulation-based parametric optimization. Probability Engrg. Inform. Sci. 12(4):519–531.Crossref, Google Scholar
Borkar VS (2009) Stochastic Approximation: A Dynamical Systems Viewpoint, vol. 48 (Springer, Berlin).Google Scholar
Borkar VS, Meyn SP (2000) The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim. 38(2):447–469.Crossref, Google Scholar
Bottou L, Curtis FE, Nocedal J (2018) Optimization methods for large-scale machine learning. SIAM Rev. 60(2):223–311.Crossref, Google Scholar
Chen Z, Maguluri ST, Shakkottai S, Shanmugam K (2020) Finite-sample analysis of contractive stochastic approximation using smooth convex envelopes. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in Neural Information Processing Systems (Curran Associates, Inc., Red Hook, NY), 8223–8234.Google Scholar
Chen Z, Zhang S, Doan TT, Clarke JP, Maguluri ST (2022) Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning. Automatica J. IFAC 146:110623.Crossref, Google Scholar
Dalal G, Szörényi B, Thoppe G, Mannor S (2018) Finite sample analysis for TD(0) with function approximation. Proc. 32nd AAAI Conf. on Artificial Intelligence (AAAI, Washington, DC).Google Scholar
Devraj AM, Meyn S (2017) Zap Q-learning. Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Adv. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY), 2235–2244.Google Scholar
Doan TT (2021) Finite-time analysis and restarting scheme for linear two-time-scale stochastic approximation. SIAM J. Control Optim. 59(4):2798–2819.Crossref, Google Scholar
Doan TT (2023) Finite-time analysis of Markov gradient descent. IEEE Trans. Automated Controls. 68(4):2140–2153.Crossref, Google Scholar
Duchi JC, Agarwal A, Johansson M, Jordan MI (2012) Ergodic mirror descent. SIAM J. Optim. 22(4):1549–1578.Crossref, Google Scholar
Espeholt L, Soyer H, Munos R, Simonyan K, Mnih V, Ward T, Doron Y, et al. (2018) IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. Proc. Internat. Conf. on Machine Learn. (PMLR, New York), 1407–1416.Google Scholar
Even-Dar E, Mansour Y (2003) Learning rates for Q-learning. J. Machine Learn. Res. 5(Dec):1–25.Google Scholar
Glynn PW, Iglehart DL (1989) Importance sampling for stochastic simulations. Management Sci. 35(11):1367–1392.Link, Google Scholar
Gosavi A (2006) Boundedness of iterates in Q-learning. Systems Control Lett. 55(4):347–349.Crossref, Google Scholar
Guzmán C, Nemirovski A (2015) On lower complexity bounds for large-scale smooth convex optimization. J. Complexity 31(1):1–14.Crossref, Google Scholar
Harutyunyan A, Bellemare MG, Stepleton T, Munos R (2016) Q(λ) with off-policy corrections. Proc. Internat. Conf. on Algorithmic Learn. Theory (Springer, Berlin), 305–320.Google Scholar
Jaakkola T, Jordan MI, Singh SP (1993) Convergence of stochastic iterative dynamic programming algorithms. Cowan J, Tesauro G, Alspector J, eds. Adv. Neural Inform. Processing Systems (Morgan-Kaufmann, Burlington, MA), 703–710.Google Scholar
Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is Q-learning provably efficient? Proc. 32nd Internat. Conf. on Neural Inform. Processing Systems (ACM, New York), 4868–4878.Google Scholar
Kaledin M, Moulines E, Naumov A, Tadic V, Wai HT (2020) Finite time analysis of linear two-timescale stochastic approximation with Markovian noise. Proc. Conf. on Learn. Theory (PMLR, New York), 2144–2203.Google Scholar
Karmakar P, Bhatnagar S (2021) Stochastic approximation with iterate-dependent Markov noise under verifiable conditions in compact state space with the stability of iterates not ensured. IEEE Trans. Automated Control. 66(12):5941–5954.Crossref, Google Scholar
Khodadadian S, Chen Z, Maguluri ST (2021) Finite-sample analysis of off-policy natural actor-critic algorithm. Proc. Internat. Conf. on Machine Learn. (PMLR, New York), 5420–5431.Google Scholar
Konda V, Tsitsiklis J (1999) Actor-critic algorithms. Solla S, Leen T, Müller K, eds. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 1008–1014.Google Scholar
Kushner H (2010) Stochastic approximation: A survey. Wiley Interdisciplinary Rev. Comput. Statist. 2(1):87–96.Crossref, Google Scholar
Kushner HJ, Clark DS (2012) Stochastic Approximation Methods for Constrained and Unconstrained Systems, vol. 26 (Springer Science & Business Media, Boston).Google Scholar
Küttler H, Nardelli N, Lavril T, Selvatici M, Sivakumar V, Rocktäschel T, Grefenstette E (2019) Torchbeast: A PyTorch platform for distributed RL. Preprint, submitted October 8, https://arxiv.org/abs/1910.03552.Google Scholar
Lan G (2020) First-Order and Stochastic Optimization Methods for Machine Learning (Springer, Berlin).Crossref, Google Scholar
Lax P (1997) Linear Algebra. Pure and Applied Mathematics: A Wiley Series of Texts, Monographs and Tracts (Wiley, New York).Google Scholar
Levin DA, Peres Y (2017) Markov Chains and Mixing Times, vol. 107 (American Mathematical Society, Providence, RI).Crossref, Google Scholar
Li G, Wei Y, Chi Y, Gu Y, Chen Y (2020) Sample complexity of asynchronous Q-learning: Sharper analysis and variance reduction. Adv. Neural Inform. Processing Systems 33:7031–7043.Google Scholar
Ljung L (1977) Analysis of recursive stochastic algorithms. IEEE Trans. Automated Control 22(4):551–575.Crossref, Google Scholar
Mirowski P, Grimes M, Malinowski M, Hermann KM, Anderson K, Teplyashin D, Simonyan K, et al. (2018) Learning to navigate in cities without a map. Proc. 32nd Internat. Conf. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY), 2424–2435.Google Scholar
Moulines E, Bach F (2011) Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Adv. Neural Inform. Processing Systems 24:451–459.Google Scholar
Munos R, Stepleton T, Harutyunyan A, Bellemare MG (2016) Safe and efficient off-policy reinforcement learning. Proc. 30th Internat. Conf. on Neural Inform. Processing Systems (ACM, New York), 1054–1062.Google Scholar
Nemirovski A, Juditsky A, Lan G, Shapiro A (2009) Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4):1574–1609.Crossref, Google Scholar
Precup D, Sutton RS, Singh SP (2000) Eligibility traces for off-policy policy evaluation. Proc. 17th Internat. Conf. on Machine Learn. (ACM, New York), 759–766.Google Scholar
Qu G, Wierman A (2020) Finite-time analysis of asynchronous stochastic approximation and Q-learning. Proc. Conf. on Learn. Theory (PMLR, New York), 3185–3205.Google Scholar
Robbins H, Monro S (1951) A stochastic approximation method. Ann. Math. Statist. 22(3):400–407.Crossref, Google Scholar
Ryu EK, Boyd S (2016) Primer on monotone operator methods. Appl. Comput. Math. 15(1):3–43.Google Scholar
Singh SP, Sutton RS (1996) Reinforcement learning with replacing eligibility traces. Machine Learn. 22(1):123–158.Crossref, Google Scholar
Srikant R, Ying L (2019) Finite-time error bounds for linear stochastic approximation and TD learning. Proc. Conf. on Learn. Theory (PMLR, New York), 2803–2830.Google Scholar
Sutton RS (1988) Learning to predict by the methods of temporal differences. Machine Learn. 3(1):9–44.Crossref, Google Scholar
Sutton RS (1999) Open theoretical questions in reinforcement learning. Proc. Eur. Conf. on Comput. Learn. Theory (Springer, Berlin), 11–17.Google Scholar
Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
Szepesvári C, et al. (1997) The asymptotic convergence-rate of Q-learning. Jordan M, Kearns M, Solla S, eds. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 1064–1070.Google Scholar
Thoppe G, Borkar V (2019) A concentration bound for stochastic approximation via Alekseev’s formula. Stochastic Systems 9(1):1–26.Link, Google Scholar
Tsitsiklis JN (1994) Asynchronous stochastic approximation and Q-learning. Machine Learn. 16(3):185–202.Crossref, Google Scholar
Tsitsiklis JN, Van Roy B (1997) Analysis of temporal-difference learning with function approximation. IEEE Trans. Automatic Control 42(5):674–690.Crossref, Google Scholar
Tsitsiklis JN, Van Roy B (1999) Average cost temporal-difference learning. Automatica J. IFAC 35(11):1799–1808.Crossref, Google Scholar
Vershynin R (2018) High-Dimensional Probability: An Introduction with Applications in Data Science, vol. 47 (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
Wainwright MJ (2019) Stochastic approximation with cone-contractive operators: Sharp ℓ∞-bounds for Q-learning. Preprint, submitted May 15, https://arxiv.org/abs/1905.06265.Google Scholar
Watkins C (1989) Learning from delayed rewards. PhD thesis, King’s College, University of Cambridge, Cambridge, UK.Google Scholar
Watkins CJ, Dayan P (1992) Q-learning. Machine Learn. 8(3–4):279–292.Crossref, Google Scholar
Yaji VG, Bhatnagar S (2019) Analysis of stochastic approximation schemes with set-valued maps in the absence of a stability guarantee and their stabilization. IEEE Trans. Automated Control 65(3):1100–1115.Crossref, Google Scholar
Zeng S, Doan TT, Romberg J (2021) Finite-time analysis of decentralized stochastic approximation with applications in multi-agent and multi-task learning. Proc. 60th IEEE Conf. Decision and Control (IEEE, Piscataway, NJ).Google Scholar

Volume 72, Issue 4

July-August 2024

Pages iii-vi, 1317-1750, C2-C3

Article Information

Supplemental Material

Metrics

Information

Received:May 16, 2022
Accepted:August 28, 2023
Published Online:October 06, 2023

Cite as

Zaiwei Chen, Siva T. Maguluri, Sanjay Shakkottai, Karthikeyan Shanmugam (2023) A Lyapunov Theory for Finite-Sample Guarantees of Markovian Stochastic Approximation. Operations Research 72(4):1352-1367.

https://doi.org/10.1287/opre.2022.0249

Keywords

Acknowledgments

Z. Chen recently moved to Caltech as a postdoctoral fellow in August 2022. This work was done when Z. Chen was affiliated with Georgia Tech. K. Shanmugam recently moved to Google Research India (Bengaluru) in April 2022. This work was done when K. Shanmugam was affiliated with IBM.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

A Lyapunov Theory for Finite-Sample Guarantees of Markovian Stochastic Approximation

References

Volume 72, Issue 4

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News