Bias and Extrapolation in Markovian Linear Stochastic Approximation with Constant Step Sizes

Dongyan (Lucy) Huo
Corresponding Author
Dongyan (Lucy) Huo
[email protected]
https://orcid.org/0000-0003-4086-2388
Department of Industrial Engineering and Decision Analytics, The Hong Kong University of Science and Technology, Kowloon, Hong Kong
Search for more papers by this author
,
Yudong Chen
Yudong Chen
[email protected]
https://orcid.org/0000-0002-6416-5635
Department of Computer Sciences, University of Wisconsin–Madison, Madison, Wisconsin 53706
Search for more papers by this author
,
Qiaomin Xie
Qiaomin Xie
[email protected]
https://orcid.org/0000-0003-2834-6866
Department of Industrial and Systems Engineering, University of Wisconsin–Madison, Madison, Wisconsin 53706
Search for more papers by this author

Corresponding Author

Dongyan (Lucy) Huo

Department of Industrial Engineering and Decision Analytics, The Hong Kong University of Science and Technology, Kowloon, Hong Kong

Search for more papers by this author

Yudong Chen

[email protected]

https://orcid.org/0000-0002-6416-5635

Department of Computer Sciences, University of Wisconsin–Madison, Madison, Wisconsin 53706

Search for more papers by this author

Qiaomin Xie

[email protected]

https://orcid.org/0000-0003-2834-6866

Department of Industrial and Systems Engineering, University of Wisconsin–Madison, Madison, Wisconsin 53706

Search for more papers by this author

Published Online:1 Jun 2026https://doi.org/10.1287/moor.2024.0471

References

[1] Allmeier S, Gast N (2023) Bias and refinement of multiscale mean field models. Proc. ACM Measurement Anal. Comput. Systems 7(1):23.Google Scholar
[2] Bach F, Moulines E (2013) Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). Burges C, Bottou L, Welling M, Ghahramani Z, Weinberger K, eds. Adv. Neural Inform. Processing Systems, vol. 26 (Curran Associates, Inc., Red Hook, NY).Google Scholar
[3] Benveniste A, Ruget G (1982) A measure of the tracking capability of recursive stochastic algorithms with constant gains. IEEE Trans. Automatic Control 27(3):639–649.Crossref, Google Scholar
[4] Benveniste A, Metivier M, Priouret P (2012) Adaptive Algorithms and Stochastic Approximations, 1st ed. (Springer, Berlin).Google Scholar
[5] Bertsekas DP (2019) Reinforcement Learning and Optimal Control (Athena Scientific, Belmont, MA).Google Scholar
[6] Bhandari J, Russo D, Singal R (2021) A finite time analysis of temporal difference learning with linear function approximation. Oper. Res. 69(3):950–973.Link, Google Scholar
[7] Blum JR (1954) Approximation methods which converge with probability one. Ann. Math. Statist. 25(2):382–386.Crossref, Google Scholar
[8] Borkar VS (2008) Stochastic Approximation: A Dynamical Systems Viewpoint (Hindustan Book Agency, Gurgaon, India).Crossref, Google Scholar
[9] Borkar VS, Meyn SP (2000) The O.D.E. method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim. 38(2):447–469.Crossref, Google Scholar
[10] Borkar V, Chen S, Devraj A, Kontoyiannis I, Meyn S (2025) The ODE method for asymptotic statistics in stochastic approximation and reinforcement learning. Ann. Appl. Probab. 35(2):936–982.Crossref, Google Scholar
[11] Braverman A, Dai JG, Miyazawa M (2025) The BAR approach for multiclass queueing networks with SBP service policies. Stochastic Systems 15(1):1–49.Link, Google Scholar
[12] Bucklew JA, Kurtz TG, Sethares WA (1993) Weak convergence and local stability properties of fixed step size recursive algorithms. IEEE Trans. Inform. Theory 39(3):966–978.Crossref, Google Scholar
[13] Chandak S, Borkar VS, Dodhia P (2022) Concentration of contractive stochastic approximation and reinforcement learning. Stochastic Systems 12(4):411–430.Link, Google Scholar
[14] Chen Z, Mou S, Maguluri ST (2022) Stationary behavior of constant stepsize SGD type algorithms: An asymptotic characterization. Proc. ACM Measurement Anal. Comput. Systems 6(1):19.Google Scholar
[15] Chen Z, Maguluri ST, Zubeldia M (2025) Concentration of contractive stochastic approximation: Additive and multiplicative noise. Ann. Appl. Probab. 35(2):1298–1352.Crossref, Google Scholar
[16] Chen Z, Maguluri ST, Shakkottai S, Shanmugam K (2020) Finite-sample analysis of contractive stochastic approximation using smooth convex envelopes. Larochelle H, Ranzato M, Hadsell R, Balcan M, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 8223–8234.Google Scholar
[17] Chen Z, Maguluri ST, Shakkottai S, Shanmugam K (2021) Finite-sample analysis of off-policy TD-learning via generalized Bellman operators. Ranzato M, Beygelzimer A, Dauphin Y, Liang P, Vaughan JW, eds. Adv. Neural Inform. Processing Systems, vol. 34 (Curran Associates, Inc., Red Hook, NY), 21440–21452.Google Scholar
[18] Chen Z, Maguluri ST, Shakkottai S, Shanmugam K (2024) A Lyapunov theory for finite-sample guarantees of Markovian stochastic approximation. Oper. Res. 72(4):1352–1367.Link, Google Scholar
[19] Cowell RG, Dawid AP, Lauritzen SL, Spiegelhalter DJ (1999) Probabilistic Networks and Expert Systems: Exact Computational Methods for Bayesian Networks, Information Science and Statistics, 1st ed. (Springer, New York).Google Scholar
[20] Dai JG, Dieker AB (2011) Nonnegativity of solutions to the basic adjoint relationship for some diffusion processes. Queueing Systems 68(3):295–303.Crossref, Google Scholar
[21] Dalal G, Szörényi B, Thoppe G, Mannor S (2018) Finite sample analyses for TD(0) with function approximation. Proc. Thirty-Second AAAI Conf. AAAI’18/IAAI’18/EAAI’18 (AAAI Press, New Orleans), 6144–6160.Google Scholar
[22] Dayan P (1992) The convergence of TD(λ) for general λ. Machine Learn. 8(3):341–362.Crossref, Google Scholar
[23] Dayan P, Sejnowski TJ (1994) TD(λ) converges with probability 1. Machine Learn. 14(3):295–301.Crossref, Google Scholar
[24] Dieuleveut A, Durmus A, Bach F (2020) Bridging the gap between constant step size stochastic gradient descent and Markov chains. Ann. Statist. 48(3):1348–1382.Crossref, Google Scholar
[25] Dong J, Tong XT (2022) Stochastic gradient descent with dependent data for offline reinforcement learning. Preprint, submitted February 6, https://arxiv.org/abs/2202.02850.Google Scholar
[26] Douc R, Moulines E, Priouret P, Soulier P (2018) Markov Chains, 1st ed. (Springer, Cham, Switzerland).Crossref, Google Scholar
[27] Durmus A, Moulines E, Naumov A, Samsonov S (2025) Finite-time high-probability bounds for Polyak–Ruppert averaged iterates of linear stochastic approximation. Math. Oper. Res. 50(2):935–964.Link, Google Scholar
[28] Durmus A, Moulines E, Naumov A, Samsonov S, Scaman K, Wai HT (2021) Tight high probability bounds for linear stochastic approximation with fixed stepsize. Ranzato M, Beygelzimer A, Dauphin Y, Liang P, Vaughan JW, eds. Adv. Neural Inform. Processing Systems, vol. 34 (Curran Associates, Inc., Red Hook, NY), 30063–30074.Google Scholar
[29] Eryilmaz A, Srikant R (2012) Asymptotically tight steady-state queue length bounds implied by drift conditions. Queueing Systems 72(3):311–359.Crossref, Google Scholar
[30] Even-Dar E, Mansour Y (2003) Learning rates for Q-learning. J. Machine Learn. Res. 5:1–25.Google Scholar
[31] Folland GB (1999) Real Analysis: Modern Techniques and Their Applications, 2nd ed. (Wiley, New York).Google Scholar
[32] Halmos PR (2014) Naive Set Theory. Undergraduate Texts in Mathematics (Springer, New York).Google Scholar
[33] Harrison JM (1985) Brownian Motion and Stochastic Flow Systems (Wiley, New York).Google Scholar
[34] Huo DL, Chen Y, Xie Q (2022) Bias and extrapolation in Markovian linear stochastic approximation with constant stepsizes. Preprint, submitted October 3, https://arxiv.org/abs/2210.00953.Google Scholar
[35] Huo DL, Chen Y, Xie Q (2023) Bias and extrapolation in Markovian linear stochastic approximation with constant stepsizes. Abstract Proc. 2023 ACM SIGMETRICS Internat. Conf. Measurement Model. Comput. Systems (Association for Computing Machinery, New York), 81–82.Google Scholar
[36] Kakade SM (2003) On the sample complexity of reinforcement learning. PhD thesis, University College London, London.Google Scholar
[37] Kearns M, Singh S (1998) Finite-sample convergence rates for Q-learning and indirect algorithms. Kearns M, Solla S, Cohn D, eds. Adv. Neural Inform. Processing Systems, vol. 11 (MIT Press, Cambridge, MA), 996–1002.Google Scholar
[38] Khalil HK (2001) Nonlinear Systems, 3rd ed. (Pearson, Upper Saddle River, NJ).Google Scholar
[39] Khamaru K, Pananjady A, Ruan F, Wainwright MJ, Jordan MI (2021) Is temporal difference learning optimal? An instance-dependent analysis. SIAM J. Math. Data Sci. 3(4):1013–1040.Crossref, Google Scholar
[40] Koller D, Parr R (2000) Policy iteration for factored MDPs. Proc. Sixteenth Conf. Uncertainty Artificial Intelligence (Morgan Kaufmann Publishers, San Francisco), 326–334.Google Scholar
[41] Kuan CM, Hornik K (1991) Convergence of learning algorithms with constant learning rates. IEEE Trans. Neural Networks 2(5):484–489.Crossref, Google Scholar
[42] Kushner HJ, Huang H (1981) Asymptotic properties of stochastic approximations with constant coefficients. SIAM J. Control Optim. 19(1):87–105.Crossref, Google Scholar
[43] Kushner HJ, Huang H (1981) Averaging methods for the asymptotic analysis of learning and adaptive systems, with small adjustment rate. SIAM J. Control Optim. 19(5):635–650.Crossref, Google Scholar
[44] Kushner HJ, Yin GG (2003) Stochastic Approximation and Recursive Algorithms and Applications, 2nd ed. (Springer, New York).Google Scholar
[45] Lagoudakis MG, Parr R (2003) Least-squares policy iteration. J. Machine Learn. Res. 4:1107–1149.Google Scholar
[46] Lakshminarayanan C, Szepesvári C (2018) Linear stochastic approximation: How far does constant step-size and iterate averaging go? Storkey A, Perez-Cruz F, eds. Proc. Twenty-First Internat. Conf. Artificial Intelligence Statist., vol. 84 (PMLR, New York), 1347–1355.Google Scholar
[47] Lauand CK, Meyn S (2023) The curse of memory in stochastic approximation. 2023 62nd IEEE Conf. Decision Control (CDC) (IEEE, Piscataway, NJ), 7803–7809.Google Scholar
[48] Lauand CK, Meyn SP (2024) Markovian foundations for quasi-stochastic approximation with applications to extremum seeking control. Preprint, submitted April 1, https://arxiv.org/abs/2207.06371.Google Scholar
[49] Levin DA, Peres Y (2017) Markov Chains and Mixing Times, 2nd ed. (American Mathematical Society, Providence, RI).Crossref, Google Scholar
[50] Meyn SP, Tweedie RL (2009) Markov Chains and Stochastic Stability, 2nd ed. (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
[51] Mou W, Pananjady A, Wainwright MJ, Bartlett PL (2024) Optimal and instance-dependent guarantees for Markovian linear stochastic approximation. Math. Statist. Learn. 7(1):41–153.Crossref, Google Scholar
[52] Mou W, Li CJ, Wainwright MJ, Bartlett PL, Jordan MI (2020) On linear stochastic approximation: Fine-grained Polyak–Ruppert and non-asymptotic concentration. Abernethy J, Agarwal S, eds. Proc. Thirty Third Conf. Learn. Theory, vol. 125 (PMLR, New York), 2947–2997.Google Scholar
[53] Nagaraj D, Wu X, Bresler G, Jain P, Netrapalli P (2020) Least squares regression with markovian data: Fundamental limits and algorithms. Larochelle H, Ranzato M, Hadsell R, Balcan M, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 16666–16676.Google Scholar
[54] Paulin D (2015) Concentration inequalities for Markov chains by Marton couplings and spectral methods. Electronic J. Probab. 20:1–32.Crossref, Google Scholar
[55] Pflug GC (1986) Stochastic minimization with constant step-size: Asymptotic laws. SIAM J. Control Optim. 24(4):655–666.Crossref, Google Scholar
[56] Pflug GC (1990) Non-asymptotic confidence bounds for stochastic approximation algorithms with constant step size. Monatshefte Mathematik 110(3):297–314.Crossref, Google Scholar
[57] Polyak BT (1990) New stochastic approximation type procedures. Automation Remote Control 51(7):98–107.Google Scholar
[58] Polyak BT, Juditsky AB (1992) Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30(4):838–855.Crossref, Google Scholar
[59] Robbins H, Monro S (1951) A stochastic approximation method. Ann. Math. Statist. 22(3):400–407.Crossref, Google Scholar
[60] Ruppert D (1988) Efficient estimations from a slowly convergent Robbins-Monro process. Technical report, Cornell University, Ithaca, NY.Google Scholar
[61] Shapiro E (1974) On the Lyapunov matrix equation. IEEE Trans. Automatic Control 19(5):594–596.Crossref, Google Scholar
[62] Srikant R, Ying L (2019) Finite-time error bounds for linear stochastic approximation and TD learning. Beygelzimer A, Hsu D, eds. Proc. Thirty-Second Conf. Learn. Theory, vol. 99 (PMLR, New York), 2803–2830.Google Scholar
[63] Stoer J, Bulirsch R (2002) Introduction to Numerical Analysis, 3rd ed. (Springer, New York).Crossref, Google Scholar
[64] Sutton RS (1988) Learning to predict by the methods of temporal differences. Machine Learn. 3(1):9–44.Crossref, Google Scholar
[65] Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (Bradford Book, Cambridge, MA).Google Scholar
[66] Szepesvári C (1997) The asymptotic convergence-rate of Q-Learning. Jordan M, Kearns M, Solla S, eds. Adv. Neural Inform. Processing Systems, vol. 10 (MIT Press, Cambridge, MA), 1064–1070.Google Scholar
[67] Tsitsiklis JN (1994) Asynchronous stochastic approximation and Q-learning. Machine Learn. 16(3):185–202.Crossref, Google Scholar
[68] Tsitsiklis JN, Van Roy B (1997) An analysis of temporal-difference learning with function approximation. IEEE Trans. Automatic Control 42(5):674–690.Crossref, Google Scholar
[69] Villani C (2009) Optimal Transport: Old and New (Springer, Berlin).Crossref, Google Scholar
[70] Watkins CJCH, Dayan P (1992) Q-learning. Machine Learn. 8(3):279–292.Crossref, Google Scholar
[71] Wright SJ, Recht B (2022) Optimization for Data Analysis (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
[72] Yu L, Balasubramanian K, Volgushev S, Erdogdu MA (2021) An analysis of constant step size SGD in the non-convex regime: Asymptotic normality and bias. Ranzato M, Beygelzimer A, Dauphin Y, Liang P, Vaughan JW, eds. Adv. Neural Inform. Processing Systems, vol. 34 (Curran Associates, Inc., Red Hook, NY), 4234–4248.Google Scholar
[73] Zhang Y, Xie Q (2024) Constant stepsize Q-learning: Distributional convergence, bias and extrapolation. Reinforcement Learn. J. 3:1168–1210.Google Scholar

cover image Mathematics of Operations Research

Articles In Advance

Article Information

Metrics

Information

Received:April 09, 2024
Accepted:December 23, 2025
Published Online:June 01, 2026

Cite as

Dongyan (Lucy) Huo, Yudong Chen, Qiaomin Xie (2026) Bias and Extrapolation in Markovian Linear Stochastic Approximation with Constant Step Sizes. Mathematics of Operations Research 0(0).

https://doi.org/10.1287/moor.2024.0471

Keywords

Acknowledgments

The authors thank Jim Dai for inspiring discussion.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Bias and Extrapolation in Markovian Linear Stochastic Approximation with Constant Step Sizes

References

Articles In Advance

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News