Temporal difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning. However, because of the nonlinearity in value function approximation, such a coupling leads to nonconvexity and even divergence in optimization. As a result, the global convergence of neural TD remains unclear. In this paper, we prove for the first time that neural TD converges at a sublinear rate to the global optimum of the mean-squared projected Bellman error for policy evaluation. In particular, we show how such global convergence is enabled by the overparameterization of neural networks, which also plays a vital role in the empirical success of neural TD. We establish the theory for two-layer neural networks in the main paper and extend them to multilayer neural networks in the appendix. Beyond policy evaluation, we establish the global convergence of neural (soft) Q learning.

Funding: Z. Yang acknowledges the Theory of Reinforceement Learning program at Simons Institute. J. D. Lee acknowledges support of the ARO under MURI Award W911NF-11-1-0304, the Sloan Research Fellowship, NSF CCF 2002272, NSF IIS 2107304, ONR Young Investigator Award, and NSF-CAREER under award #2144994. Z. Wang acknowledges National Science Foundation [Awards 2048075, 2008827, 2015568, 1934931], Simons Institute (Theory of Reinforcement Learning), Amazon, J.P. Morgan, and Two Sigma for their supports.

cover image Mathematics of Operations Research

Volume 49, Issue 1

February 2024

Pages 1-651, C2

Article Information

Metrics

Information

Received:September 12, 2020
Accepted:December 14, 2022
Published Online:April 28, 2023

Cite as

Qi Cai, Zhuoran Yang, Jason D. Lee, Zhaoran Wang (2023) Neural Temporal Difference and Q Learning Provably Converge to Global Optima. Mathematics of Operations Research 49(1):619-651.

https://doi.org/10.1287/moor.2023.1370

Keywords

Acknowledgments

This work is a generalization of Cai et al. [16].

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Neural Temporal Difference and Q Learning Provably Converge to Global Optima

Abstract

Volume 49, Issue 1

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News