[1] Achiam J, Knight E, Abbeel P (2019) Toward characterizing divergence in deep Q-learning. Preprint, submitted March 21, https://arxiv.org/abs/1903.08894.Google Scholar
[2] Agazzi A, Lu J (2021) Temporal-difference learning for nonlinear value function approximation in the lazy training regime. Preprint, Submitted August 11, https://arxiv.org/abs/1905.10917.Google Scholar
[3] Allen-Zhu Z, Li Y, Liang Y (2019) Learning and generalization in overparameterized neural networks, going beyond two layers. Proc. 33rd Internat. Conf. Neural Inform. Processing Systems (NeurIPS, San Diego, CA), 6158–6169.Google Scholar
[4] Allen-Zhu Z, Li Y, Song Z (2019) A convergence theory for deep learning via over-parameterization. Preprint, submitted June 17, https://arxiv.org/abs/1811.03962.Google Scholar
[5] Amiranashvili A, Dosovitskiy A, Koltun V, Brox T (2018) TD or not TD: Analyzing the role of temporal differencing in deep reinforcement learning. Preprint, submitted June 4, https://arxiv.org/abs/1806.01175.Google Scholar
[6] Arora S, Du SS, Hu W, Li Z, Wang R (2019) Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. Preprint, submitted May 27, https://arxiv.org/abs/1901.08584.Google Scholar
[7] Baird LC (1995) Residual algorithms: Reinforcement learning with function approximation. Proc. 20th Internat. Conf. Machine Learn. (ICML, San Diego, CA), 30–37.Google Scholar
[8] Bertsekas DP (2019) Feature-based aggregation and deep reinforcement learning: A survey and some new implementations. IEEE/CAA J. Automatica Sinica 6(1):1–31.Crossref, Google Scholar
[9] Bhandari J, Russo D, Singal R (2021) A finite time analysis of temporal difference learning with linear function approximation. Oper. Res. 69(3):950–973.Google Scholar
[10] Borkar VS (2009) Stochastic Approximation: A Dynamical Systems Viewpoint, vol. 48 (Cambridge University Press, Cambridge, UK).Google Scholar
[11] Borkar VS, Meyn SP (2000) The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim. 38:447–469.Crossref, Google Scholar
[12] Boyan JA (1999) Least-squares temporal difference learning. Proc. 16th Internat. Conf. Machine Learn. (ICML, San Diego, CA), 49–56.Google Scholar
[13] Boyan JA, Moore AW (1994) Generalization in reinforcement learning: Safely approximating the value function. Proc. 7th Internat. Conf. Neural Inform. Processing Systems (NeurIPS, San Diego, CA), 369–376.Google Scholar
[14] Bradtke SJ, Barto AG (1996) Linear least-squares algorithms for temporal difference learning. Machine Learn. 22:33–57.Crossref, Google Scholar
[15] Brandfonbrener D, Bruna J (2020) Geometric insights into the convergence of nonlinear TD learning. Proc. Internat. Conf. Learn. Representations.Google Scholar
[16] Cai Q, Yang Z, Lee JD, Wang Z (2019) Neural temporal-difference learning converges to global optima. Proc. 33th Internat. Conf. Neural Inform. Processing Systems (NeurIPS, San Diego, CA), 11315–11326.Google Scholar
[17] Cao Y, Gu Q (2019) Generalization bounds of stochastic gradient descent for wide and deep neural networks. Proc. 33th Internat. Conf. Neural Inform. Processing Systems (NeurIPS, San Diego, CA), 10836–10846.Google Scholar
[18] Cao Y, Gu Q (2020) Generalization error bounds of gradient descent for learning over-parameterized deep ReLU networks. Proc. AAAI Conf. Artificial Intelligence 34(4):3349–3356.Google Scholar
[19] Chizat L, Oyallon E, Bach F (2019) On lazy training in differentiable programming. Proc. 33th Internat. Conf. Neural Inform. Processing Systems (NeurIPS, San Diego, CA), 2937–2947.Google Scholar
[20] Chung W, Nath S, Joseph A, White M (2019) Two-timescale networks for nonlinear value function approximation. Proc. Internat. Conf. Learn. Representations.Google Scholar
[21] Dalal G, Szörényi B, Thoppe G, Mannor S (2018) Finite sample analyses for TD(0) with function approximation. Proc. 32nd AAAI Conf. Artificial Intelligence 30th Innovative Appl. Artificial Intelligence Conf. and 8th AAAI Sympos. Ed. Adv. Artificial Intelligence (AAAI Press, Palo Alto, CA), 6144–6160.Google Scholar
[22] Daniely A (2017) SGD learns the conjugate kernel class of the network. Adv. Neural Inform. Processing Systems.Google Scholar
[23] Dann C, Neumann G, Peters J (2014) Policy evaluation with temporal differences: A survey and comparison. J. Machine Learn. Res. 15:809–883.Google Scholar
[24] Du SS, Zhai X, Poczos B, Singh A (2019) Gradient descent provably optimizes over-parameterized neural networks. Proc. Internat. Conf. Learn. Representations.Google Scholar
[25] Du SS, Chen J, Li L, Xiao L, Zhou D (2017) Stochastic variance reduction methods for policy evaluation. Proc. Internat. Conf. Machine Learn.Google Scholar
[26] Duan Y, Chen X, Houthooft R, Schulman J, Abbeel P (2016) Benchmarking deep reinforcement learning for continuous control. Proc. Internat. Conf. Machine Learn. (PMLR, New York), 1329–1338.Google Scholar
[27] Espeholt L, Soyer H, Munos R, Simonyan K, Mnih V, Ward T, Doron Y, et al. (2018) IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. Proc. Internat. Conf. Machine Learn. (PMLR, New York), 1407–1416.Google Scholar
[28] Facchinei F, Pang J-S (2007) Finite-Dimensional Variational Inequalities and Complementarity Problems (Springer Science & Business Media, New York).Google Scholar
[29] Fan J, Ma C, Zhong Y (2021) A selective overview of deep learning. Statist. Sci.: A Rev. J. Inst. Math. Statist. 36(2):264.Google Scholar
[30] Gao R, Cai T, Li H, Wang L, Hsieh C-J, Lee JD (2019) Convergence of adversarial training in overparametrized networks. Proc. 33rd Internat. Conf. Neural Inform. Processing Systems (NeurIPS, San Diego, CA), 13029–13040.Google Scholar
[31] Geist M, Pietquin O (2013) Algorithmic survey of parametric value function approximation. IEEE Trans. Neural Networks Learn. Systems 24:845–867.Crossref, Google Scholar
[32] Ghavamzadeh M, Lazaric A, Maillard O, Munos R (2010) LSTD with random projections. Proc. 23rd Internat. Conf. Neural Inform. Processing Systems, vol. 1 (NeurIPS, San Diego, CA), 721–729.Google Scholar
[33] Guan Z, Xu T, Liang Y (2022) PER-ETD: A polynomially efficient emphatic temporal difference learning method. Internat. Conf. Learn. Representations.Google Scholar
[34] Haarnoja T, Tang H, Abbeel P, Levine S (2017) Reinforcement learning with deep energy-based policies. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (ICML, San Diego, CA), 1352–1361.Google Scholar
[35] Harker PT, Pang J-S (1990) Finite-dimensional variational inequality and nonlinear complementarity problems: A survey of theory, algorithms and applications. Math. Programming 48:161–220.Crossref, Google Scholar
[36] Henderson P, Islam R, Bachman P, Pineau J, Precup D, Meger D (2018) Deep reinforcement learning that matters. Proc. 32nd AAAI Conf. Artificial Intelligence 30th Innovative Appl. Artificial Intelligence Conf. 8th AAAI Sympos. Ed. Adv. Artificial Intelligence (AAAI Press, Palo Alto, CA), 3207–3214.Google Scholar
[37] Hofmann T, Schölkopf B, Smola AJ (2008) Kernel methods in machine learning. Ann. Statist. 36:1171–1220.Crossref, Google Scholar
[38] Jaakkola T, Jordan MI, Singh SP (1994) Convergence of stochastic iterative dynamic programming algorithms. Neural Comput. 6(6):1185–1201.Google Scholar
[39] Jacot A, Gabriel F, Hongler C (2018) Neural tangent kernel: Convergence and generalization in neural networks. Proc. 32nd Neural Inform. Processing Systems (NeurIPS, San Diego, CA), 8580–8589.Google Scholar
[40] Jain P, Kar P (2017) Non-convex optimization for machine learning. Foundations Trends Machine Learn. 10:142–336.Crossref, Google Scholar
[41] Kushner H, Yin GG (2003) Stochastic Approximation and Recursive Algorithms and Applications (Springer Science & Business Media, New York).Google Scholar
[42] Lakshminarayanan C, Szepesvari C (2018) Linear stochastic approximation: How far does constant step-size and iterate averaging go? Proc. Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 1347–1355.Google Scholar
[43] Lazaric A, Ghavamzadeh M, Munos R (2010) Finite-sample analysis of LSTD. Proc. Internat. Conf. Machine Learn. (ICML, San Diego, CA), 615–622.Google Scholar
[44] Lee J, Xiao L, Schoenholz SS, Bahri Y, Novak R, Sohl-Dickstein J, Pennington J (2019) Wide neural networks of any depth evolve as linear models under gradient descent. Proc. 33rd Internat. Conf. Neural Inform. Processing Systems (NeurIPS, San Diego, CA), 8572–8583.Google Scholar
[45] Li Y, Liang Y (2018) Learning overparameterized neural networks via stochastic gradient descent on structured data. Proc. 32nd Internat. Conf. Neural Inform. Processing Systems (NeurIPS, San Diego, CA), 8168–8177.Google Scholar
[46] Liu B, Liu J, Ghavamzadeh M, Mahadevan S, Petrik M (2015) Finite-sample analysis of proximal gradient TD algorithms. Proc. 31st Conf. Uncertainty Artificial Intelligence (AAAI Press, San Diego, CA), 504–513.Google Scholar
[47] Maei HR, Szepesv´ri C, Bhatnagar S, Precup D, Silver D, Sutton RS (2009) Convergent temporal-difference learning with arbitrary smooth function approximation. Proc. 22nd Internat. Conf. Neural Inform. Processing Systems (NeurIPS, San Diego, CA), 1204–1212.Google Scholar
[48] Melo FS, Meyn SP, Ribeiro MI (2008) An analysis of reinforcement learning with function approximation. Proc. 25th Internat. Conf. Machine Learn. (ICML, San Diego, CA), 664–671.Google Scholar
[49] Neu G, Jonsson A, Gómez V (2017) A unified view of entropy-regularized Markov decision processes. Preprint, submitted May 22, https://arxiv.org/abs/1705.07798.Google Scholar
[50] Neyshabur B, Li Z, Bhojanapalli S, LeCun Y, Srebro N (2019) The role of over-parametrization in generalization of neural networks. Proc. Internat. Conf. Learn. Representations.Google Scholar
[51] Pfau D, Vinyals O (2016) Connecting generative adversarial networks and actor-critic methods. Preprint, submitted January 18, https://arxiv.org/abs/1610.01945.Google Scholar
[52] Rahimi A, Recht B (2007) Random features for large-scale kernel machines. Proc. 20th Internat. Conf. Neural Inform. Processing Systems (NeurIPS, San Diego, CA), 1177–1184.Google Scholar
[53] Rahimi A, Recht B (2008) Uniform approximation of functions with random bases. 46th Annual Allerton Conf. Comm. Control Comput. (IEEE, Piscataway, NJ), 555–561.Google Scholar
[54] Schulman J, Moritz P, Levine S, Jordan M, Abbeel P (2015) High-dimensional continuous control using generalized advantage estimation. Preprint, submitted October 20, https://arxiv.org/abs/1506.02438.Google Scholar
[55] Singh SP, Sutton RS (1996) Reinforcement learning with replacing eligibility traces. Machine Learn. 22:123–158.Crossref, Google Scholar
[56] Srikant R, Ying L (2019) Finite-time error bounds for linear stochastic approximation and TD learning. Proc. Conf. Learn. Theory. (PMLR, New York), 2803–2830.Google Scholar
[57] Sutton RS (1988) Learning to predict by the methods of temporal differences. Machine Learn. 3:9–44.Crossref, Google Scholar
[58] Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
[59] Sutton RS, Maei HR, Szepesvári C (2009) A convergent o(n) temporal-difference algorithm for off-policy learning with linear function approximation. Adv. Neural Inform. Processing Systems 21(2009):1609–1616.Google Scholar
[60] Sutton RS, Maei HR, Precup D, Bhatnagar S, Silver D, Szepesvári C, Wiewiora E (2009) Fast gradient-descent methods for temporal-difference learning with linear function approximation. Proc. 26th Annual Internat. Conf. Machine Learn. (PMLR, New York), 993–1000.Google Scholar
[61] Szepesvári C (2010) Algorithms for Reinforcement Learning, Synthesis Lectures on Artificial Intelligence and Machine Learning (Springer, Cham, Switzerland).Crossref, Google Scholar
[62] Touati A, Bacon P-L, Precup D, Vincent P (2018) Convergent tree-backup and retrace with function approximation. Proc. Internat. Conf. Machine Learn. (PMLR, New York), 4955–4694.Google Scholar
[63] Tsitsiklis JN, Van Roy B (1997) An analysis of temporal-difference learning with function approximation. IEEE Trans. Automatic Control 42(5):674–690.Google Scholar
[64] Tu S, Recht B (2018) Least-squares temporal difference learning for the linear quadratic regulator. Proc. Internat. Conf. Machine Learn. (PMLR, New York), 5005–5014.Google Scholar
[65] Wang G, Giannakis GB (2020) Finite-time error bounds for biased stochastic approximation with applications to Q-learning. Proc. Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 3015–3024.Google Scholar
[66] Wang Y, Chen W, Liu Y, Ma Z-M, Liu T-Y (2017) Finite sample analysis of the GTD policy evaluation algorithms in Markov setting. Proc. 31st Internat. Conf. Neural Inform. Processing Systems (NeurIPS, San Diego, CA), 5510–5519.Google Scholar
[67] Xu P, Gu Q (2020) A finite-time analysis of Q-learning with neural network function approximation. Proc. Internat. Conf. Machine Learn. (PMLR, New York), 10555–10565.Google Scholar
[68] Xu T, Wang Z, Zhou Y, Liang Y (2019) Reanalysis of variance reduced temporal difference learning. Proc. Internat. Conf. Learn. Representations.Google Scholar
[69] Zhang C, Bengio S, Hardt M, Recht B, Vinyals O (2017) Understanding deep learning requires rethinking generalization. Preprint, submitted February 26, https://arxiv.org/abs/1611.03530.Google Scholar
[70] Zou S, Xu T, Liang Y (2019) Finite-sample analysis for SARSA and Q-learning with linear function approximation. Proc. 33rd Internat. Conf. Neural Inform. Processing Systems (NeurPIS, San Diego, CA), 8668–8678.Google Scholar
[71] Zou D, Cao Y, Zhou D, Gu Q (2020) Gradient descent optimizes over-parameterized deep ReLU networks. Machine Learn. 109:467–492.Crossref, Google Scholar

cover image Mathematics of Operations Research

Volume 49, Issue 1

February 2024

Pages 1-651, C2

Article Information

Metrics

Information

Received:September 12, 2020
Accepted:December 14, 2022
Published Online:April 28, 2023

Cite as

Qi Cai, Zhuoran Yang, Jason D. Lee, Zhaoran Wang (2023) Neural Temporal Difference and Q Learning Provably Converge to Global Optima. Mathematics of Operations Research 49(1):619-651.

https://doi.org/10.1287/moor.2023.1370

Keywords

Acknowledgments

This work is a generalization of Cai et al. [16].

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Neural Temporal Difference and Q Learning Provably Converge to Global Optima

References

Volume 49, Issue 1

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News