A Concentration Bound for TD(0) with Function Approximation

Published Online:https://doi.org/10.1287/stsy.2023.0055

References

  • Azar MG, Osband I, Munos R (2017) Minimax regret bounds for reinforcement learning. Precup D, Teh YW, eds. Proc. 34th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 70 (PMLR, New York), 263–272.Google Scholar
  • Bhandari J, Russo D, Singal R (2018) A finite time analysis of temporal difference learning with linear function approximation. Proc. 31st Conf. Learn. Theory (PMLR, New York), 1691–1692.Google Scholar
  • Borkar VS (1991) Topics in Controlled Markov Chains, Pitman Research Notes in Mathematics Series, vol. 240 (Longman Scientific & Technical, Harlow, Essex, UK).Google Scholar
  • Borkar VS (2002) On the lock-in probability of stochastic approximation. Combin. Probab. Comput. 11(1):11–20.Google Scholar
  • Borkar VS (2022) Corrigendum to “A concentration bound for contractive stochastic approximation” [Syst. Control Lett. 153 (2021) 104947]. Systems Control Lett. 153:104947.Google Scholar
  • Chandak S, Borkar VS, Dodhia P (2022) Concentration of contractive stochastic approximation and reinforcement learning. Stochastic Systems 12(4):411–430.LinkGoogle Scholar
  • Chandak S, Borkar VS, Dolhare H (2023) A concentration bound for LSPE(λ). Systems Control Lett. 171:105418.Google Scholar
  • Chen Z, Maguluri ST, Zubeldia M (2025) Concentration of contractive stochastic approximation: Additive and multiplicative noise. Ann. Appl. Probab. 35(2):1298–1352.Google Scholar
  • Chen Z, Maguluri ST, Shakkottai S, Shanmugam K (2020) Finite-sample analysis of contractive stochastic approximation using smooth convex envelopes. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. NIPS’20: Proc. 34th Internat. Conf. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY), 8223–8234.Google Scholar
  • Chen Z, Maguluri ST, Shakkottai S, Shanmugam K (2021) A Lyapunov theory for finite-sample guarantees of asynchronous Q-learning and TD-learning variants. Preprint, submitted February 2, https://arxiv.org/abs/2102.01567.Google Scholar
  • Chung F, Lu L (2006) Concentration inequalities and martingale inequalities: A survey. Internet Math. 3(1):79–127.Google Scholar
  • Dalal G, Szörényi B, Thoppe G, Mannor S (2018) Finite sample analyses for TD(0) with function approximation. Proc. AAAI Conf. Artificial Intelligence, vol. 32 (AAAI Press, Palo Alto, CA).Google Scholar
  • Even-Dar E, Mansour Y (2003) Learning rates for Q-learning. J. Machine Learn. Res. 5:1–25.Google Scholar
  • Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is Q-learning provably efficient? Bengio S, Wallach HM, Larochelle H, Grauman K, Cesa-Bianchi N, eds. NIPS’18: Proc. 32nd Internat. Conf. Neural Inform. Processing Systems, vol. 31 (Curran Associates, Inc., Red Hook, NY), 4868–4878.Google Scholar
  • Kamal S (2010) On the convergence, lock-in probability, and sample complexity of stochastic approximation. SIAM J. Control Optim. 48(8):5178–5192.Google Scholar
  • Li G, Cai C, Chen Y, Wei Y, Chi Y (2023) Is Q-learning minimax optimal? A tight sample complexity analysis. Oper. Res. 72(1):222–236.Google Scholar
  • Liu Q, Watbled F (2009) Exponential inequalities for martingales and asymptotic properties of the free energy of directed polymers in a random environment. Stochastic Processes Their Appl. 119(10):3101–3132.Google Scholar
  • Meerkov SM (1972) Simplified description of slow Markov walks. Part II. Automation Remote Control 33(5):761.Google Scholar
  • Patil G, Prashanth LA, Nagaraj D, Precup D (2023) Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation. Proc. 26th Internat. Conf. Artificial Intelligence Statist., Proceedings of Machine Learning Research, vol. 206 (PMLR, New York), 5438–5448.Google Scholar
  • Prashanth LA, Korda N, Munos R (2021) Concentration bounds for temporal difference learning with linear function approximation: The case of batch data and uniform sampling. Machine Learn. 110(3):559–618.Google Scholar
  • Qu G, Wierman A (2020) Finite-time analysis of asynchronous stochastic approximation and Q-learning. Proc. 33rd Conf. Learn. Theory (PMLR, New York), 3185–3205.Google Scholar
  • Srikant R, Ying L (2019) Finite-time error bounds for linear stochastic approximation and TD learning. Proc. 32nd Conf. Learn. Theory (PMLR, New York), 2803–2830.Google Scholar
  • Tao T, Vu V (2015) Random matrices: Universality of local spectral statistics of non-Hermitian matrices. Ann. Probab. 43(2):782 –874.Google Scholar
  • Thoppe G, Borkar V (2019) A concentration bound for stochastic approximation via Alekseev’s formula. Stochastic Systems 9(1):1–26.LinkGoogle Scholar
  • Tsitsiklis J, Van Roy B (1997) An analysis of temporal-difference learning with function approximation. IEEE Trans. Automatic Control 42(5):674–690.Google Scholar
  • Yang L, Wang M (2019) Sample-optimal parametric Q-learning using linearly additive features. Proc. 36th Internat. Conf. Machine Learn. (PMLR, New York), 6995–7004.Google Scholar
  • Yang Z, Jin C, Wang Z, Wang M, Jordan M (2020) On function approximation in reinforcement learning: Optimism in the face of large state spaces. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. NIPS’20: Proc. 34th Internat. Conf. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY), 13903–13916.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.