Neural Temporal Difference and Q Learning Provably Converge to Global Optima

Published Online:https://doi.org/10.1287/moor.2023.1370

References

  • [1] Achiam J, Knight E, Abbeel P (2019) Toward characterizing divergence in deep Q-learning. Preprint, submitted March 21, https://arxiv.org/abs/1903.08894.Google Scholar
  • [2] Agazzi A, Lu J (2021) Temporal-difference learning for nonlinear value function approximation in the lazy training regime. Preprint, Submitted August 11, https://arxiv.org/abs/1905.10917.Google Scholar
  • [3] Allen-Zhu Z, Li Y, Liang Y (2019) Learning and generalization in overparameterized neural networks, going beyond two layers. Proc. 33rd Internat. Conf. Neural Inform. Processing Systems (NeurIPS, San Diego, CA), 6158–6169.Google Scholar
  • [4] Allen-Zhu Z, Li Y, Song Z (2019) A convergence theory for deep learning via over-parameterization. Preprint, submitted June 17, https://arxiv.org/abs/1811.03962.Google Scholar
  • [5] Amiranashvili A, Dosovitskiy A, Koltun V, Brox T (2018) TD or not TD: Analyzing the role of temporal differencing in deep reinforcement learning. Preprint, submitted June 4, https://arxiv.org/abs/1806.01175.Google Scholar
  • [6] Arora S, Du SS, Hu W, Li Z, Wang R (2019) Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. Preprint, submitted May 27, https://arxiv.org/abs/1901.08584.Google Scholar
  • [7] Baird LC (1995) Residual algorithms: Reinforcement learning with function approximation. Proc. 20th Internat. Conf. Machine Learn. (ICML, San Diego, CA), 30–37.Google Scholar
  • [8] Bertsekas DP (2019) Feature-based aggregation and deep reinforcement learning: A survey and some new implementations. IEEE/CAA J. Automatica Sinica 6(1):1–31.CrossrefGoogle Scholar
  • [9] Bhandari J, Russo D, Singal R (2021) A finite time analysis of temporal difference learning with linear function approximation. Oper. Res. 69(3):950–973.Google Scholar
  • [10] Borkar VS (2009) Stochastic Approximation: A Dynamical Systems Viewpoint, vol. 48 (Cambridge University Press, Cambridge, UK).Google Scholar
  • [11] Borkar VS, Meyn SP (2000) The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim. 38:447–469.CrossrefGoogle Scholar
  • [12] Boyan JA (1999) Least-squares temporal difference learning. Proc. 16th Internat. Conf. Machine Learn. (ICML, San Diego, CA), 49–56.Google Scholar
  • [13] Boyan JA, Moore AW (1994) Generalization in reinforcement learning: Safely approximating the value function. Proc. 7th Internat. Conf. Neural Inform. Processing Systems (NeurIPS, San Diego, CA), 369–376.Google Scholar
  • [14] Bradtke SJ, Barto AG (1996) Linear least-squares algorithms for temporal difference learning. Machine Learn. 22:33–57.CrossrefGoogle Scholar
  • [15] Brandfonbrener D, Bruna J (2020) Geometric insights into the convergence of nonlinear TD learning. Proc. Internat. Conf. Learn. Representations.Google Scholar
  • [16] Cai Q, Yang Z, Lee JD, Wang Z (2019) Neural temporal-difference learning converges to global optima. Proc. 33th Internat. Conf. Neural Inform. Processing Systems (NeurIPS, San Diego, CA), 11315–11326.Google Scholar
  • [17] Cao Y, Gu Q (2019) Generalization bounds of stochastic gradient descent for wide and deep neural networks. Proc. 33th Internat. Conf. Neural Inform. Processing Systems (NeurIPS, San Diego, CA), 10836–10846.Google Scholar
  • [18] Cao Y, Gu Q (2020) Generalization error bounds of gradient descent for learning over-parameterized deep ReLU networks. Proc. AAAI Conf. Artificial Intelligence 34(4):3349–3356.Google Scholar
  • [19] Chizat L, Oyallon E, Bach F (2019) On lazy training in differentiable programming. Proc. 33th Internat. Conf. Neural Inform. Processing Systems (NeurIPS, San Diego, CA), 2937–2947.Google Scholar
  • [20] Chung W, Nath S, Joseph A, White M (2019) Two-timescale networks for nonlinear value function approximation. Proc. Internat. Conf. Learn. Representations.Google Scholar
  • [21] Dalal G, Szörényi B, Thoppe G, Mannor S (2018) Finite sample analyses for TD(0) with function approximation. Proc. 32nd AAAI Conf. Artificial Intelligence 30th Innovative Appl. Artificial Intelligence Conf. and 8th AAAI Sympos. Ed. Adv. Artificial Intelligence (AAAI Press, Palo Alto, CA), 6144–6160.Google Scholar
  • [22] Daniely A (2017) SGD learns the conjugate kernel class of the network. Adv. Neural Inform. Processing Systems.Google Scholar
  • [23] Dann C, Neumann G, Peters J (2014) Policy evaluation with temporal differences: A survey and comparison. J. Machine Learn. Res. 15:809–883.Google Scholar
  • [24] Du SS, Zhai X, Poczos B, Singh A (2019) Gradient descent provably optimizes over-parameterized neural networks. Proc. Internat. Conf. Learn. Representations.Google Scholar
  • [25] Du SS, Chen J, Li L, Xiao L, Zhou D (2017) Stochastic variance reduction methods for policy evaluation. Proc. Internat. Conf. Machine Learn.Google Scholar
  • [26] Duan Y, Chen X, Houthooft R, Schulman J, Abbeel P (2016) Benchmarking deep reinforcement learning for continuous control. Proc. Internat. Conf. Machine Learn. (PMLR, New York), 1329–1338.Google Scholar
  • [27] Espeholt L, Soyer H, Munos R, Simonyan K, Mnih V, Ward T, Doron Y, et al. (2018) IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. Proc. Internat. Conf. Machine Learn. (PMLR, New York), 1407–1416.Google Scholar
  • [28] Facchinei F, Pang J-S (2007) Finite-Dimensional Variational Inequalities and Complementarity Problems (Springer Science & Business Media, New York).Google Scholar
  • [29] Fan J, Ma C, Zhong Y (2021) A selective overview of deep learning. Statist. Sci.: A Rev. J. Inst. Math. Statist. 36(2):264.Google Scholar
  • [30] Gao R, Cai T, Li H, Wang L, Hsieh C-J, Lee JD (2019) Convergence of adversarial training in overparametrized networks. Proc. 33rd Internat. Conf. Neural Inform. Processing Systems (NeurIPS, San Diego, CA), 13029–13040.Google Scholar
  • [31] Geist M, Pietquin O (2013) Algorithmic survey of parametric value function approximation. IEEE Trans. Neural Networks Learn. Systems 24:845–867.CrossrefGoogle Scholar
  • [32] Ghavamzadeh M, Lazaric A, Maillard O, Munos R (2010) LSTD with random projections. Proc. 23rd Internat. Conf. Neural Inform. Processing Systems, vol. 1 (NeurIPS, San Diego, CA), 721–729.Google Scholar
  • [33] Guan Z, Xu T, Liang Y (2022) PER-ETD: A polynomially efficient emphatic temporal difference learning method. Internat. Conf. Learn. Representations.Google Scholar
  • [34] Haarnoja T, Tang H, Abbeel P, Levine S (2017) Reinforcement learning with deep energy-based policies. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (ICML, San Diego, CA), 1352–1361.Google Scholar
  • [35] Harker PT, Pang J-S (1990) Finite-dimensional variational inequality and nonlinear complementarity problems: A survey of theory, algorithms and applications. Math. Programming 48:161–220.CrossrefGoogle Scholar
  • [36] Henderson P, Islam R, Bachman P, Pineau J, Precup D, Meger D (2018) Deep reinforcement learning that matters. Proc. 32nd AAAI Conf. Artificial Intelligence 30th Innovative Appl. Artificial Intelligence Conf. 8th AAAI Sympos. Ed. Adv. Artificial Intelligence (AAAI Press, Palo Alto, CA), 3207–3214.Google Scholar
  • [37] Hofmann T, Schölkopf B, Smola AJ (2008) Kernel methods in machine learning. Ann. Statist. 36:1171–1220.CrossrefGoogle Scholar
  • [38] Jaakkola T, Jordan MI, Singh SP (1994) Convergence of stochastic iterative dynamic programming algorithms. Neural Comput. 6(6):1185–1201.Google Scholar
  • [39] Jacot A, Gabriel F, Hongler C (2018) Neural tangent kernel: Convergence and generalization in neural networks. Proc. 32nd Neural Inform. Processing Systems (NeurIPS, San Diego, CA), 8580–8589.Google Scholar
  • [40] Jain P, Kar P (2017) Non-convex optimization for machine learning. Foundations Trends Machine Learn. 10:142–336.CrossrefGoogle Scholar
  • [41] Kushner H, Yin GG (2003) Stochastic Approximation and Recursive Algorithms and Applications (Springer Science & Business Media, New York).Google Scholar
  • [42] Lakshminarayanan C, Szepesvari C (2018) Linear stochastic approximation: How far does constant step-size and iterate averaging go? Proc. Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 1347–1355.Google Scholar
  • [43] Lazaric A, Ghavamzadeh M, Munos R (2010) Finite-sample analysis of LSTD. Proc. Internat. Conf. Machine Learn. (ICML, San Diego, CA), 615–622.Google Scholar
  • [44] Lee J, Xiao L, Schoenholz SS, Bahri Y, Novak R, Sohl-Dickstein J, Pennington J (2019) Wide neural networks of any depth evolve as linear models under gradient descent. Proc. 33rd Internat. Conf. Neural Inform. Processing Systems (NeurIPS, San Diego, CA), 8572–8583.Google Scholar
  • [45] Li Y, Liang Y (2018) Learning overparameterized neural networks via stochastic gradient descent on structured data. Proc. 32nd Internat. Conf. Neural Inform. Processing Systems (NeurIPS, San Diego, CA), 8168–8177.Google Scholar
  • [46] Liu B, Liu J, Ghavamzadeh M, Mahadevan S, Petrik M (2015) Finite-sample analysis of proximal gradient TD algorithms. Proc. 31st Conf. Uncertainty Artificial Intelligence (AAAI Press, San Diego, CA), 504–513.Google Scholar
  • [47] Maei HR, Szepesv´ri C, Bhatnagar S, Precup D, Silver D, Sutton RS (2009) Convergent temporal-difference learning with arbitrary smooth function approximation. Proc. 22nd Internat. Conf. Neural Inform. Processing Systems (NeurIPS, San Diego, CA), 1204–1212.Google Scholar
  • [48] Melo FS, Meyn SP, Ribeiro MI (2008) An analysis of reinforcement learning with function approximation. Proc. 25th Internat. Conf. Machine Learn. (ICML, San Diego, CA), 664–671.Google Scholar
  • [49] Neu G, Jonsson A, Gómez V (2017) A unified view of entropy-regularized Markov decision processes. Preprint, submitted May 22, https://arxiv.org/abs/1705.07798.Google Scholar
  • [50] Neyshabur B, Li Z, Bhojanapalli S, LeCun Y, Srebro N (2019) The role of over-parametrization in generalization of neural networks. Proc. Internat. Conf. Learn. Representations.Google Scholar
  • [51] Pfau D, Vinyals O (2016) Connecting generative adversarial networks and actor-critic methods. Preprint, submitted January 18, https://arxiv.org/abs/1610.01945.Google Scholar
  • [52] Rahimi A, Recht B (2007) Random features for large-scale kernel machines. Proc. 20th Internat. Conf. Neural Inform. Processing Systems (NeurIPS, San Diego, CA), 1177–1184.Google Scholar
  • [53] Rahimi A, Recht B (2008) Uniform approximation of functions with random bases. 46th Annual Allerton Conf. Comm. Control Comput. (IEEE, Piscataway, NJ), 555–561.Google Scholar
  • [54] Schulman J, Moritz P, Levine S, Jordan M, Abbeel P (2015) High-dimensional continuous control using generalized advantage estimation. Preprint, submitted October 20, https://arxiv.org/abs/1506.02438.Google Scholar
  • [55] Singh SP, Sutton RS (1996) Reinforcement learning with replacing eligibility traces. Machine Learn. 22:123–158.CrossrefGoogle Scholar
  • [56] Srikant R, Ying L (2019) Finite-time error bounds for linear stochastic approximation and TD learning. Proc. Conf. Learn. Theory. (PMLR, New York), 2803–2830.Google Scholar
  • [57] Sutton RS (1988) Learning to predict by the methods of temporal differences. Machine Learn. 3:9–44.CrossrefGoogle Scholar
  • [58] Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
  • [59] Sutton RS, Maei HR, Szepesvári C (2009) A convergent o(n) temporal-difference algorithm for off-policy learning with linear function approximation. Adv. Neural Inform. Processing Systems 21(2009):1609–1616.Google Scholar
  • [60] Sutton RS, Maei HR, Precup D, Bhatnagar S, Silver D, Szepesvári C, Wiewiora E (2009) Fast gradient-descent methods for temporal-difference learning with linear function approximation. Proc. 26th Annual Internat. Conf. Machine Learn. (PMLR, New York), 993–1000.Google Scholar
  • [61] Szepesvári C (2010) Algorithms for Reinforcement Learning, Synthesis Lectures on Artificial Intelligence and Machine Learning (Springer, Cham, Switzerland).CrossrefGoogle Scholar
  • [62] Touati A, Bacon P-L, Precup D, Vincent P (2018) Convergent tree-backup and retrace with function approximation. Proc. Internat. Conf. Machine Learn. (PMLR, New York), 4955–4694.Google Scholar
  • [63] Tsitsiklis JN, Van Roy B (1997) An analysis of temporal-difference learning with function approximation. IEEE Trans. Automatic Control 42(5):674–690.Google Scholar
  • [64] Tu S, Recht B (2018) Least-squares temporal difference learning for the linear quadratic regulator. Proc. Internat. Conf. Machine Learn. (PMLR, New York), 5005–5014.Google Scholar
  • [65] Wang G, Giannakis GB (2020) Finite-time error bounds for biased stochastic approximation with applications to Q-learning. Proc. Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 3015–3024.Google Scholar
  • [66] Wang Y, Chen W, Liu Y, Ma Z-M, Liu T-Y (2017) Finite sample analysis of the GTD policy evaluation algorithms in Markov setting. Proc. 31st Internat. Conf. Neural Inform. Processing Systems (NeurIPS, San Diego, CA), 5510–5519.Google Scholar
  • [67] Xu P, Gu Q (2020) A finite-time analysis of Q-learning with neural network function approximation. Proc. Internat. Conf. Machine Learn. (PMLR, New York), 10555–10565.Google Scholar
  • [68] Xu T, Wang Z, Zhou Y, Liang Y (2019) Reanalysis of variance reduced temporal difference learning. Proc. Internat. Conf. Learn. Representations.Google Scholar
  • [69] Zhang C, Bengio S, Hardt M, Recht B, Vinyals O (2017) Understanding deep learning requires rethinking generalization. Preprint, submitted February 26, https://arxiv.org/abs/1611.03530.Google Scholar
  • [70] Zou S, Xu T, Liang Y (2019) Finite-sample analysis for SARSA and Q-learning with linear function approximation. Proc. 33rd Internat. Conf. Neural Inform. Processing Systems (NeurPIS, San Diego, CA), 8668–8678.Google Scholar
  • [71] Zou D, Cao Y, Zhou D, Gu Q (2020) Gradient descent optimizes over-parameterized deep ReLU networks. Machine Learn. 109:467–492.CrossrefGoogle Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.