Bias and Extrapolation in Markovian Linear Stochastic Approximation with Constant Step Sizes
Published Online:1 Jun 2026https://doi.org/10.1287/moor.2024.0471
References
- [1] (2023) Bias and refinement of multiscale mean field models. Proc. ACM Measurement Anal. Comput. Systems 7(1):23.Google Scholar
- [2] (2013) Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). Burges C, Bottou L, Welling M, Ghahramani Z, Weinberger K, eds. Adv. Neural Inform. Processing Systems, vol. 26 (Curran Associates, Inc., Red Hook, NY).Google Scholar
- [3] (1982) A measure of the tracking capability of recursive stochastic algorithms with constant gains. IEEE Trans. Automatic Control 27(3):639–649.Crossref, Google Scholar
- [4] (2012) Adaptive Algorithms and Stochastic Approximations, 1st ed. (Springer, Berlin).Google Scholar
- [5] (2019) Reinforcement Learning and Optimal Control (Athena Scientific, Belmont, MA).Google Scholar
- [6] (2021) A finite time analysis of temporal difference learning with linear function approximation. Oper. Res. 69(3):950–973.Link, Google Scholar
- [7] (1954) Approximation methods which converge with probability one. Ann. Math. Statist. 25(2):382–386.Crossref, Google Scholar
- [8] (2008) Stochastic Approximation: A Dynamical Systems Viewpoint (Hindustan Book Agency, Gurgaon, India).Crossref, Google Scholar
- [9] (2000) The O.D.E. method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim. 38(2):447–469.Crossref, Google Scholar
- [10] (2025) The ODE method for asymptotic statistics in stochastic approximation and reinforcement learning. Ann. Appl. Probab. 35(2):936–982.Crossref, Google Scholar
- [11] (2025) The BAR approach for multiclass queueing networks with SBP service policies. Stochastic Systems 15(1):1–49.Link, Google Scholar
- [12] (1993) Weak convergence and local stability properties of fixed step size recursive algorithms. IEEE Trans. Inform. Theory 39(3):966–978.Crossref, Google Scholar
- [13] (2022) Concentration of contractive stochastic approximation and reinforcement learning. Stochastic Systems 12(4):411–430.Link, Google Scholar
- [14] (2022) Stationary behavior of constant stepsize SGD type algorithms: An asymptotic characterization. Proc. ACM Measurement Anal. Comput. Systems 6(1):19.Google Scholar
- [15] (2025) Concentration of contractive stochastic approximation: Additive and multiplicative noise. Ann. Appl. Probab. 35(2):1298–1352.Crossref, Google Scholar
- [16] (2020) Finite-sample analysis of contractive stochastic approximation using smooth convex envelopes. Larochelle H, Ranzato M, Hadsell R, Balcan M, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 8223–8234.Google Scholar
- [17] (2021) Finite-sample analysis of off-policy TD-learning via generalized Bellman operators. Ranzato M, Beygelzimer A, Dauphin Y, Liang P, Vaughan JW, eds. Adv. Neural Inform. Processing Systems, vol. 34 (Curran Associates, Inc., Red Hook, NY), 21440–21452.Google Scholar
- [18] (2024) A Lyapunov theory for finite-sample guarantees of Markovian stochastic approximation. Oper. Res. 72(4):1352–1367.Link, Google Scholar
- [19] (1999) Probabilistic Networks and Expert Systems: Exact Computational Methods for Bayesian Networks, Information Science and Statistics, 1st ed. (Springer, New York).Google Scholar
- [20] (2011) Nonnegativity of solutions to the basic adjoint relationship for some diffusion processes. Queueing Systems 68(3):295–303.Crossref, Google Scholar
- [21] (2018) Finite sample analyses for TD(0) with function approximation. Proc. Thirty-Second AAAI Conf. AAAI’18/IAAI’18/EAAI’18 (AAAI Press, New Orleans), 6144–6160.Google Scholar
- [22] (1992) The convergence of TD(λ) for general λ. Machine Learn. 8(3):341–362.Crossref, Google Scholar
- [23] (1994) TD(λ) converges with probability 1. Machine Learn. 14(3):295–301.Crossref, Google Scholar
- [24] (2020) Bridging the gap between constant step size stochastic gradient descent and Markov chains. Ann. Statist. 48(3):1348–1382.Crossref, Google Scholar
- [25] (2022) Stochastic gradient descent with dependent data for offline reinforcement learning. Preprint, submitted February 6, https://arxiv.org/abs/2202.02850.Google Scholar
- [26] (2018) Markov Chains, 1st ed. (Springer, Cham, Switzerland).Crossref, Google Scholar
- [27] (2025) Finite-time high-probability bounds for Polyak–Ruppert averaged iterates of linear stochastic approximation. Math. Oper. Res. 50(2):935–964.Link, Google Scholar
- [28] (2021) Tight high probability bounds for linear stochastic approximation with fixed stepsize. Ranzato M, Beygelzimer A, Dauphin Y, Liang P, Vaughan JW, eds. Adv. Neural Inform. Processing Systems, vol. 34 (Curran Associates, Inc., Red Hook, NY), 30063–30074.Google Scholar
- [29] (2012) Asymptotically tight steady-state queue length bounds implied by drift conditions. Queueing Systems 72(3):311–359.Crossref, Google Scholar
- [30] (2003) Learning rates for Q-learning. J. Machine Learn. Res. 5:1–25.Google Scholar
- [31] (1999) Real Analysis: Modern Techniques and Their Applications, 2nd ed. (Wiley, New York).Google Scholar
- [32] (2014) Naive Set Theory. Undergraduate Texts in Mathematics (Springer, New York).Google Scholar
- [33] (1985) Brownian Motion and Stochastic Flow Systems (Wiley, New York).Google Scholar
- [34] (2022) Bias and extrapolation in Markovian linear stochastic approximation with constant stepsizes. Preprint, submitted October 3, https://arxiv.org/abs/2210.00953.Google Scholar
- [35] (2023) Bias and extrapolation in Markovian linear stochastic approximation with constant stepsizes. Abstract Proc. 2023 ACM SIGMETRICS Internat. Conf. Measurement Model. Comput. Systems (Association for Computing Machinery, New York), 81–82.Google Scholar
- [36] (2003) On the sample complexity of reinforcement learning. PhD thesis, University College London, London.Google Scholar
- [37] (1998) Finite-sample convergence rates for Q-learning and indirect algorithms. Kearns M, Solla S, Cohn D, eds. Adv. Neural Inform. Processing Systems, vol. 11 (MIT Press, Cambridge, MA), 996–1002.Google Scholar
- [38] (2001) Nonlinear Systems, 3rd ed. (Pearson, Upper Saddle River, NJ).Google Scholar
- [39] (2021) Is temporal difference learning optimal? An instance-dependent analysis. SIAM J. Math. Data Sci. 3(4):1013–1040.Crossref, Google Scholar
- [40] (2000) Policy iteration for factored MDPs. Proc. Sixteenth Conf. Uncertainty Artificial Intelligence (Morgan Kaufmann Publishers, San Francisco), 326–334.Google Scholar
- [41] (1991) Convergence of learning algorithms with constant learning rates. IEEE Trans. Neural Networks 2(5):484–489.Crossref, Google Scholar
- [42] (1981) Asymptotic properties of stochastic approximations with constant coefficients. SIAM J. Control Optim. 19(1):87–105.Crossref, Google Scholar
- [43] (1981) Averaging methods for the asymptotic analysis of learning and adaptive systems, with small adjustment rate. SIAM J. Control Optim. 19(5):635–650.Crossref, Google Scholar
- [44] (2003) Stochastic Approximation and Recursive Algorithms and Applications, 2nd ed. (Springer, New York).Google Scholar
- [45] (2003) Least-squares policy iteration. J. Machine Learn. Res. 4:1107–1149.Google Scholar
- [46] (2018) Linear stochastic approximation: How far does constant step-size and iterate averaging go? Storkey A, Perez-Cruz F, eds. Proc. Twenty-First Internat. Conf. Artificial Intelligence Statist., vol. 84 (PMLR, New York), 1347–1355.Google Scholar
- [47] (2023) The curse of memory in stochastic approximation. 2023 62nd IEEE Conf. Decision Control (CDC) (IEEE, Piscataway, NJ), 7803–7809.Google Scholar
- [48] (2024) Markovian foundations for quasi-stochastic approximation with applications to extremum seeking control. Preprint, submitted April 1, https://arxiv.org/abs/2207.06371.Google Scholar
- [49] (2017) Markov Chains and Mixing Times, 2nd ed. (American Mathematical Society, Providence, RI).Crossref, Google Scholar
- [50] (2009) Markov Chains and Stochastic Stability, 2nd ed. (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
- [51] (2024) Optimal and instance-dependent guarantees for Markovian linear stochastic approximation. Math. Statist. Learn. 7(1):41–153.Crossref, Google Scholar
- [52] (2020) On linear stochastic approximation: Fine-grained Polyak–Ruppert and non-asymptotic concentration. Abernethy J, Agarwal S, eds. Proc. Thirty Third Conf. Learn. Theory, vol. 125 (PMLR, New York), 2947–2997.Google Scholar
- [53] (2020) Least squares regression with markovian data: Fundamental limits and algorithms. Larochelle H, Ranzato M, Hadsell R, Balcan M, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 16666–16676.Google Scholar
- [54] (2015) Concentration inequalities for Markov chains by Marton couplings and spectral methods. Electronic J. Probab. 20:1–32.Crossref, Google Scholar
- [55] (1986) Stochastic minimization with constant step-size: Asymptotic laws. SIAM J. Control Optim. 24(4):655–666.Crossref, Google Scholar
- [56] (1990) Non-asymptotic confidence bounds for stochastic approximation algorithms with constant step size. Monatshefte Mathematik 110(3):297–314.Crossref, Google Scholar
- [57] (1990) New stochastic approximation type procedures. Automation Remote Control 51(7):98–107.Google Scholar
- [58] (1992) Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30(4):838–855.Crossref, Google Scholar
- [59] (1951) A stochastic approximation method. Ann. Math. Statist. 22(3):400–407.Crossref, Google Scholar
- [60] (1988) Efficient estimations from a slowly convergent Robbins-Monro process. Technical report, Cornell University, Ithaca, NY.Google Scholar
- [61] (1974) On the Lyapunov matrix equation. IEEE Trans. Automatic Control 19(5):594–596.Crossref, Google Scholar
- [62] (2019) Finite-time error bounds for linear stochastic approximation and TD learning. Beygelzimer A, Hsu D, eds. Proc. Thirty-Second Conf. Learn. Theory, vol. 99 (PMLR, New York), 2803–2830.Google Scholar
- [63] (2002) Introduction to Numerical Analysis, 3rd ed. (Springer, New York).Crossref, Google Scholar
- [64] (1988) Learning to predict by the methods of temporal differences. Machine Learn. 3(1):9–44.Crossref, Google Scholar
- [65] (2018) Reinforcement Learning: An Introduction (Bradford Book, Cambridge, MA).Google Scholar
- [66] (1997) The asymptotic convergence-rate of Q-Learning. Jordan M, Kearns M, Solla S, eds. Adv. Neural Inform. Processing Systems, vol. 10 (MIT Press, Cambridge, MA), 1064–1070.Google Scholar
- [67] (1994) Asynchronous stochastic approximation and Q-learning. Machine Learn. 16(3):185–202.Crossref, Google Scholar
- [68] (1997) An analysis of temporal-difference learning with function approximation. IEEE Trans. Automatic Control 42(5):674–690.Crossref, Google Scholar
- [69] (2009) Optimal Transport: Old and New (Springer, Berlin).Crossref, Google Scholar
- [70] (1992) Q-learning. Machine Learn. 8(3):279–292.Crossref, Google Scholar
- [71] (2022) Optimization for Data Analysis (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
- [72] (2021) An analysis of constant step size SGD in the non-convex regime: Asymptotic normality and bias. Ranzato M, Beygelzimer A, Dauphin Y, Liang P, Vaughan JW, eds. Adv. Neural Inform. Processing Systems, vol. 34 (Curran Associates, Inc., Red Hook, NY), 4234–4248.Google Scholar
- [73] (2024) Constant stepsize Q-learning: Distributional convergence, bias and extrapolation. Reinforcement Learn. J. 3:1168–1210.Google Scholar

