Bias and Extrapolation in Markovian Linear Stochastic Approximation with Constant Step Sizes

Published Online:https://doi.org/10.1287/moor.2024.0471

References

  • [1] Allmeier S, Gast N (2023) Bias and refinement of multiscale mean field models. Proc. ACM Measurement Anal. Comput. Systems 7(1):23.Google Scholar
  • [2] Bach F, Moulines E (2013) Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). Burges C, Bottou L, Welling M, Ghahramani Z, Weinberger K, eds. Adv. Neural Inform. Processing Systems, vol. 26 (Curran Associates, Inc., Red Hook, NY).Google Scholar
  • [3] Benveniste A, Ruget G (1982) A measure of the tracking capability of recursive stochastic algorithms with constant gains. IEEE Trans. Automatic Control 27(3):639–649.CrossrefGoogle Scholar
  • [4] Benveniste A, Metivier M, Priouret P (2012) Adaptive Algorithms and Stochastic Approximations, 1st ed. (Springer, Berlin).Google Scholar
  • [5] Bertsekas DP (2019) Reinforcement Learning and Optimal Control (Athena Scientific, Belmont, MA).Google Scholar
  • [6] Bhandari J, Russo D, Singal R (2021) A finite time analysis of temporal difference learning with linear function approximation. Oper. Res. 69(3):950–973.LinkGoogle Scholar
  • [7] Blum JR (1954) Approximation methods which converge with probability one. Ann. Math. Statist. 25(2):382–386.CrossrefGoogle Scholar
  • [8] Borkar VS (2008) Stochastic Approximation: A Dynamical Systems Viewpoint (Hindustan Book Agency, Gurgaon, India).CrossrefGoogle Scholar
  • [9] Borkar VS, Meyn SP (2000) The O.D.E. method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim. 38(2):447–469.CrossrefGoogle Scholar
  • [10] Borkar V, Chen S, Devraj A, Kontoyiannis I, Meyn S (2025) The ODE method for asymptotic statistics in stochastic approximation and reinforcement learning. Ann. Appl. Probab. 35(2):936–982.CrossrefGoogle Scholar
  • [11] Braverman A, Dai JG, Miyazawa M (2025) The BAR approach for multiclass queueing networks with SBP service policies. Stochastic Systems 15(1):1–49.LinkGoogle Scholar
  • [12] Bucklew JA, Kurtz TG, Sethares WA (1993) Weak convergence and local stability properties of fixed step size recursive algorithms. IEEE Trans. Inform. Theory 39(3):966–978.CrossrefGoogle Scholar
  • [13] Chandak S, Borkar VS, Dodhia P (2022) Concentration of contractive stochastic approximation and reinforcement learning. Stochastic Systems 12(4):411–430.LinkGoogle Scholar
  • [14] Chen Z, Mou S, Maguluri ST (2022) Stationary behavior of constant stepsize SGD type algorithms: An asymptotic characterization. Proc. ACM Measurement Anal. Comput. Systems 6(1):19.Google Scholar
  • [15] Chen Z, Maguluri ST, Zubeldia M (2025) Concentration of contractive stochastic approximation: Additive and multiplicative noise. Ann. Appl. Probab. 35(2):1298–1352.CrossrefGoogle Scholar
  • [16] Chen Z, Maguluri ST, Shakkottai S, Shanmugam K (2020) Finite-sample analysis of contractive stochastic approximation using smooth convex envelopes. Larochelle H, Ranzato M, Hadsell R, Balcan M, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 8223–8234.Google Scholar
  • [17] Chen Z, Maguluri ST, Shakkottai S, Shanmugam K (2021) Finite-sample analysis of off-policy TD-learning via generalized Bellman operators. Ranzato M, Beygelzimer A, Dauphin Y, Liang P, Vaughan JW, eds. Adv. Neural Inform. Processing Systems, vol. 34 (Curran Associates, Inc., Red Hook, NY), 21440–21452.Google Scholar
  • [18] Chen Z, Maguluri ST, Shakkottai S, Shanmugam K (2024) A Lyapunov theory for finite-sample guarantees of Markovian stochastic approximation. Oper. Res. 72(4):1352–1367.LinkGoogle Scholar
  • [19] Cowell RG, Dawid AP, Lauritzen SL, Spiegelhalter DJ (1999) Probabilistic Networks and Expert Systems: Exact Computational Methods for Bayesian Networks, Information Science and Statistics, 1st ed. (Springer, New York).Google Scholar
  • [20] Dai JG, Dieker AB (2011) Nonnegativity of solutions to the basic adjoint relationship for some diffusion processes. Queueing Systems 68(3):295–303.CrossrefGoogle Scholar
  • [21] Dalal G, Szörényi B, Thoppe G, Mannor S (2018) Finite sample analyses for TD(0) with function approximation. Proc. Thirty-Second AAAI Conf. AAAI’18/IAAI’18/EAAI’18 (AAAI Press, New Orleans), 6144–6160.Google Scholar
  • [22] Dayan P (1992) The convergence of TD(λ) for general λ. Machine Learn. 8(3):341–362.CrossrefGoogle Scholar
  • [23] Dayan P, Sejnowski TJ (1994) TD(λ) converges with probability 1. Machine Learn. 14(3):295–301.CrossrefGoogle Scholar
  • [24] Dieuleveut A, Durmus A, Bach F (2020) Bridging the gap between constant step size stochastic gradient descent and Markov chains. Ann. Statist. 48(3):1348–1382.CrossrefGoogle Scholar
  • [25] Dong J, Tong XT (2022) Stochastic gradient descent with dependent data for offline reinforcement learning. Preprint, submitted February 6, https://arxiv.org/abs/2202.02850.Google Scholar
  • [26] Douc R, Moulines E, Priouret P, Soulier P (2018) Markov Chains, 1st ed. (Springer, Cham, Switzerland).CrossrefGoogle Scholar
  • [27] Durmus A, Moulines E, Naumov A, Samsonov S (2025) Finite-time high-probability bounds for Polyak–Ruppert averaged iterates of linear stochastic approximation. Math. Oper. Res. 50(2):935–964.LinkGoogle Scholar
  • [28] Durmus A, Moulines E, Naumov A, Samsonov S, Scaman K, Wai HT (2021) Tight high probability bounds for linear stochastic approximation with fixed stepsize. Ranzato M, Beygelzimer A, Dauphin Y, Liang P, Vaughan JW, eds. Adv. Neural Inform. Processing Systems, vol. 34 (Curran Associates, Inc., Red Hook, NY), 30063–30074.Google Scholar
  • [29] Eryilmaz A, Srikant R (2012) Asymptotically tight steady-state queue length bounds implied by drift conditions. Queueing Systems 72(3):311–359.CrossrefGoogle Scholar
  • [30] Even-Dar E, Mansour Y (2003) Learning rates for Q-learning. J. Machine Learn. Res. 5:1–25.Google Scholar
  • [31] Folland GB (1999) Real Analysis: Modern Techniques and Their Applications, 2nd ed. (Wiley, New York).Google Scholar
  • [32] Halmos PR (2014) Naive Set Theory. Undergraduate Texts in Mathematics (Springer, New York).Google Scholar
  • [33] Harrison JM (1985) Brownian Motion and Stochastic Flow Systems (Wiley, New York).Google Scholar
  • [34] Huo DL, Chen Y, Xie Q (2022) Bias and extrapolation in Markovian linear stochastic approximation with constant stepsizes. Preprint, submitted October 3, https://arxiv.org/abs/2210.00953.Google Scholar
  • [35] Huo DL, Chen Y, Xie Q (2023) Bias and extrapolation in Markovian linear stochastic approximation with constant stepsizes. Abstract Proc. 2023 ACM SIGMETRICS Internat. Conf. Measurement Model. Comput. Systems (Association for Computing Machinery, New York), 81–82.Google Scholar
  • [36] Kakade SM (2003) On the sample complexity of reinforcement learning. PhD thesis, University College London, London.Google Scholar
  • [37] Kearns M, Singh S (1998) Finite-sample convergence rates for Q-learning and indirect algorithms. Kearns M, Solla S, Cohn D, eds. Adv. Neural Inform. Processing Systems, vol. 11 (MIT Press, Cambridge, MA), 996–1002.Google Scholar
  • [38] Khalil HK (2001) Nonlinear Systems, 3rd ed. (Pearson, Upper Saddle River, NJ).Google Scholar
  • [39] Khamaru K, Pananjady A, Ruan F, Wainwright MJ, Jordan MI (2021) Is temporal difference learning optimal? An instance-dependent analysis. SIAM J. Math. Data Sci. 3(4):1013–1040.CrossrefGoogle Scholar
  • [40] Koller D, Parr R (2000) Policy iteration for factored MDPs. Proc. Sixteenth Conf. Uncertainty Artificial Intelligence (Morgan Kaufmann Publishers, San Francisco), 326–334.Google Scholar
  • [41] Kuan CM, Hornik K (1991) Convergence of learning algorithms with constant learning rates. IEEE Trans. Neural Networks 2(5):484–489.CrossrefGoogle Scholar
  • [42] Kushner HJ, Huang H (1981) Asymptotic properties of stochastic approximations with constant coefficients. SIAM J. Control Optim. 19(1):87–105.CrossrefGoogle Scholar
  • [43] Kushner HJ, Huang H (1981) Averaging methods for the asymptotic analysis of learning and adaptive systems, with small adjustment rate. SIAM J. Control Optim. 19(5):635–650.CrossrefGoogle Scholar
  • [44] Kushner HJ, Yin GG (2003) Stochastic Approximation and Recursive Algorithms and Applications, 2nd ed. (Springer, New York).Google Scholar
  • [45] Lagoudakis MG, Parr R (2003) Least-squares policy iteration. J. Machine Learn. Res. 4:1107–1149.Google Scholar
  • [46] Lakshminarayanan C, Szepesvári C (2018) Linear stochastic approximation: How far does constant step-size and iterate averaging go? Storkey A, Perez-Cruz F, eds. Proc. Twenty-First Internat. Conf. Artificial Intelligence Statist., vol. 84 (PMLR, New York), 1347–1355.Google Scholar
  • [47] Lauand CK, Meyn S (2023) The curse of memory in stochastic approximation. 2023 62nd IEEE Conf. Decision Control (CDC) (IEEE, Piscataway, NJ), 7803–7809.Google Scholar
  • [48] Lauand CK, Meyn SP (2024) Markovian foundations for quasi-stochastic approximation with applications to extremum seeking control. Preprint, submitted April 1, https://arxiv.org/abs/2207.06371.Google Scholar
  • [49] Levin DA, Peres Y (2017) Markov Chains and Mixing Times, 2nd ed. (American Mathematical Society, Providence, RI).CrossrefGoogle Scholar
  • [50] Meyn SP, Tweedie RL (2009) Markov Chains and Stochastic Stability, 2nd ed. (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • [51] Mou W, Pananjady A, Wainwright MJ, Bartlett PL (2024) Optimal and instance-dependent guarantees for Markovian linear stochastic approximation. Math. Statist. Learn. 7(1):41–153.CrossrefGoogle Scholar
  • [52] Mou W, Li CJ, Wainwright MJ, Bartlett PL, Jordan MI (2020) On linear stochastic approximation: Fine-grained Polyak–Ruppert and non-asymptotic concentration. Abernethy J, Agarwal S, eds. Proc. Thirty Third Conf. Learn. Theory, vol. 125 (PMLR, New York), 2947–2997.Google Scholar
  • [53] Nagaraj D, Wu X, Bresler G, Jain P, Netrapalli P (2020) Least squares regression with markovian data: Fundamental limits and algorithms. Larochelle H, Ranzato M, Hadsell R, Balcan M, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 16666–16676.Google Scholar
  • [54] Paulin D (2015) Concentration inequalities for Markov chains by Marton couplings and spectral methods. Electronic J. Probab. 20:1–32.CrossrefGoogle Scholar
  • [55] Pflug GC (1986) Stochastic minimization with constant step-size: Asymptotic laws. SIAM J. Control Optim. 24(4):655–666.CrossrefGoogle Scholar
  • [56] Pflug GC (1990) Non-asymptotic confidence bounds for stochastic approximation algorithms with constant step size. Monatshefte Mathematik 110(3):297–314.CrossrefGoogle Scholar
  • [57] Polyak BT (1990) New stochastic approximation type procedures. Automation Remote Control 51(7):98–107.Google Scholar
  • [58] Polyak BT, Juditsky AB (1992) Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30(4):838–855.CrossrefGoogle Scholar
  • [59] Robbins H, Monro S (1951) A stochastic approximation method. Ann. Math. Statist. 22(3):400–407.CrossrefGoogle Scholar
  • [60] Ruppert D (1988) Efficient estimations from a slowly convergent Robbins-Monro process. Technical report, Cornell University, Ithaca, NY.Google Scholar
  • [61] Shapiro E (1974) On the Lyapunov matrix equation. IEEE Trans. Automatic Control 19(5):594–596.CrossrefGoogle Scholar
  • [62] Srikant R, Ying L (2019) Finite-time error bounds for linear stochastic approximation and TD learning. Beygelzimer A, Hsu D, eds. Proc. Thirty-Second Conf. Learn. Theory, vol. 99 (PMLR, New York), 2803–2830.Google Scholar
  • [63] Stoer J, Bulirsch R (2002) Introduction to Numerical Analysis, 3rd ed. (Springer, New York).CrossrefGoogle Scholar
  • [64] Sutton RS (1988) Learning to predict by the methods of temporal differences. Machine Learn. 3(1):9–44.CrossrefGoogle Scholar
  • [65] Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (Bradford Book, Cambridge, MA).Google Scholar
  • [66] Szepesvári C (1997) The asymptotic convergence-rate of Q-Learning. Jordan M, Kearns M, Solla S, eds. Adv. Neural Inform. Processing Systems, vol. 10 (MIT Press, Cambridge, MA), 1064–1070.Google Scholar
  • [67] Tsitsiklis JN (1994) Asynchronous stochastic approximation and Q-learning. Machine Learn. 16(3):185–202.CrossrefGoogle Scholar
  • [68] Tsitsiklis JN, Van Roy B (1997) An analysis of temporal-difference learning with function approximation. IEEE Trans. Automatic Control 42(5):674–690.CrossrefGoogle Scholar
  • [69] Villani C (2009) Optimal Transport: Old and New (Springer, Berlin).CrossrefGoogle Scholar
  • [70] Watkins CJCH, Dayan P (1992) Q-learning. Machine Learn. 8(3):279–292.CrossrefGoogle Scholar
  • [71] Wright SJ, Recht B (2022) Optimization for Data Analysis (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • [72] Yu L, Balasubramanian K, Volgushev S, Erdogdu MA (2021) An analysis of constant step size SGD in the non-convex regime: Asymptotic normality and bias. Ranzato M, Beygelzimer A, Dauphin Y, Liang P, Vaughan JW, eds. Adv. Neural Inform. Processing Systems, vol. 34 (Curran Associates, Inc., Red Hook, NY), 4234–4248.Google Scholar
  • [73] Zhang Y, Xie Q (2024) Constant stepsize Q-learning: Distributional convergence, bias and extrapolation. Reinforcement Learn. J. 3:1168–1210.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.