Open Access

A Diffusion Approximation Theory of Momentum Stochastic Gradient Descent in Nonconvex Optimization

Tianyi Liu
Tianyi Liu
[email protected]
https://orcid.org/0000-0002-5573-5093
School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, Georgia 30318
Search for more papers by this author
,
Zhehui Chen
Zhehui Chen
[email protected]
School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, Georgia 30318
Search for more papers by this author
,
Enlu Zhou
Enlu Zhou
[email protected]
https://orcid.org/0000-0001-5399-6508
School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, Georgia 30318
Search for more papers by this author
,
Tuo Zhao
Tuo Zhao
[email protected]
School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, Georgia 30318
Search for more papers by this author

Tianyi Liu

[email protected]

https://orcid.org/0000-0002-5573-5093

School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, Georgia 30318

Search for more papers by this author

Zhehui Chen

[email protected]

School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, Georgia 30318

Search for more papers by this author

Enlu Zhou

[email protected]

https://orcid.org/0000-0001-5399-6508

School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, Georgia 30318

Search for more papers by this author

Tuo Zhao

[email protected]

School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, Georgia 30318

Search for more papers by this author

Published Online:21 Oct 2021https://doi.org/10.1287/stsy.2021.0083

References

Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, et al. (2016) Tensorflow: Large-scale machine learning on heterogeneous distributed systems. Preprint, submitted March 14, https://arxiv.org/abs/1603.04467v2.Google Scholar
Borkar VS (1997) Stochastic approximation with two time scales. Systems Control Lett. 29(5):291–294.Google Scholar
Borkar VS (2009) Stochastic Approximation: A Dynamical Systems Viewpoint, vol. 48 (Springer, Berlin).Google Scholar
Borkar VS, Meyn SP (2000) The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim. 38(2):447–469.Google Scholar
Chen Z, Yang FL, Li CJ, Zhao T (2017) Online multiview representation learning: Dropping convexity for better efficiency. Preprint, submitted February 27, https://arxiv.org/abs/1702.08134v1.Google Scholar
Chen M, Yang L, Wang M, Zhao T (2018) Dimensionality reduction for stationary time series via stochastic nonconvex optimization. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inform. Processing Systems (Curran Associates, Inc., Montreal, Quebec, Canada), 3496–3506.Google Scholar
Choromanska A, Henaff M, Mathieu M, Arous GB, LeCun Y (2015) The loss surfaces of multilayer networks. Guy L, Vishwanathan SVN, eds. Artificial Intelligence Statist. (PMLR, California), 192–204.Google Scholar
Fu MC, ed. (2015) Handbook of Simulation Optimization, vol. 216 (Springer, Berlin).Google Scholar
Ge R, Lee JD, Ma T (2016) Matrix completion has no spurious local minimum. Lee D, Sugiyama M, Luxburg U, Guyon I, Garnett R, eds. Adv. Neural Inform. Processing Systems (Curran Associates, Inc., Barcelona, Spain), 2973–2981.Google Scholar
Ghadimi S, Lan G (2013) Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4):2341–2368.Google Scholar
Ghadimi S, Lan G (2016) Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Programming 156(1-2):59–99.Google Scholar
Goodfellow I, Bengio Y, Courville A, Bengio Y (2016) Deep Learning, vol. 1 (MIT Press, Cambridge, MA).Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. Proc. IEEE Conf. Comput. Vision Pattern Recognition (IEEE, San Juan, PR), 770–778.Google Scholar
Jin C, Netrapalli P, Jordan MI (2017) Accelerated gradient descent escapes saddle points faster than gradient descent. Preprint, submitted November 28, https://arxiv.org/abs/1711.10456.Google Scholar
Keskar NS, Mudigere D, Nocedal J, Smelyanskiy M, Tang PTP (2016) On large-batch training for deep learning: Generalization gap and sharp minima. Preprint, submitted September 15, https://arxiv.org/abs/1609.04836.Google Scholar
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. Preprint, submitted December 22, https://arxiv.org/abs/1412.6980.Google Scholar
Krichene W, Bartlett PL (2017) Acceleration and averaging in stochastic mirror descent dynamics. Preprint, submitted July 19, https://arxiv.org/abs/1707.06219.Google Scholar
Kushner HJ, Yin GG (2003) Stochastic Approximation and Recursive Algorithms and Applications, vol. 35 (Springer-Verlag, New York).Google Scholar
Li Q, Tai C, Weinan E (2017) Stochastic modified equations and adaptive stochastic gradient algorithms. Precup D, Whye Teh Y, eds. Internat. Conf. Machine Learn. (PMLR, Sydney, Australia), 2101–2110.Google Scholar
Li X, Wang Z, Lu J, Arora R, Haupt J, Liu H, Zhao T (2016) Symmetry, saddle points, and global geometry of nonconvex matrix factorization. Preprint, submitted December 29, https://arxiv.org/abs/1612.09296.Google Scholar
Liu T, Li S, Shi J, Zhou E, Zhao T (2018) Toward understanding acceleration tradeoff between momentum and asynchrony in distributed nonconvex stochastic optimization. Preprint, submitted June 4, https://arxiv.org/abs/1806.01660.Google Scholar
Matthews AGdG, Rowland M, Hron J, Turner RE, Ghahramani Z (2018) Gaussian process behaviour in wide deep neural networks. Preprint, submitted April 30, https://arxiv.org/abs/1804.11271.Google Scholar
Mei S, Misiakiewicz T, Montanari A (2019) Mean-field theory of two-layers neural networks: Dimension-free bounds and kernel limit. Preprint, submitted February 16, https://arxiv.org/abs/1902.06015.Google Scholar
Mei S, Montanari A, Nguyen PM (2018) A mean field view of the landscape of two-layer neural networks. Proc. Natl. Acad. Sci. USA 115(33):E7665–E7671.Google Scholar
Nesterov Y (1983) A method for unconstrained convex minimization problem with the rate of convergence O(1/k2). Proc. USSR Acad. Sci. 269:543–547.Google Scholar
Newton D, Pasupathy R, Yousefian F (2018) Recent trends in stochastic gradient descent for machine learning and big data. Rabe M, Juan AA, Mustafee N, Skoogh A, Jain S, Johansson B, eds. Proc. 2018 Winter Simulation Conf. (IEEE Press, Gothenburg, Sweden), 366–380.Google Scholar
Neyshabur B, Bhojanapalli S, McAllester D, Srebro N (2017) Exploring generalization in deep learning. Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Adv. Neural Inform. Processing Systems (Curran Associates, Inc., California), 5949–5958.Google Scholar
Øksendal B (2003) Stochastic Differential Equations (Springer, Berlin).Google Scholar
Page D (2018) How to train your ResNet. Accessed September 24, 2018, https://myrtle.ai/learn/how-to-train-your-resnet/.Google Scholar
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, et al. (2019) Pytorch: An imperative style, high-performance deep learning library. Wallach H, Larochelle H, Beygelzimer A, d'Alché-Buc F, Fox E, Garnett R, eds. Adv. Neural Inform. Processing Systems (Curran Associates, Inc., Vancouver, BC, Canada), 8026–8037.Google Scholar
Polyak BT (1964) Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5):1–17.Google Scholar
Robbins H, Monro S (1951) A stochastic approximation method. Ann. Math. Statist. 22(3):400–407.Google Scholar
Rotskoff GM, Vanden-Eijnden E (2018) Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. Preprint, submitted May 2, https://arxiv.org/pdf/1805.00915v1.pdf.Google Scholar
Sirignano J, Spiliopoulos K (2018) Mean field analysis of neural networks. Preprint, submitted May 2, https://arxiv.org/abs/1805.01053.Google Scholar
Sirignano J, Spiliopoulos K (2019) Mean field analysis of deep neural networks. Preprint, submitted March 11, https://arxiv.org/abs/1903.04440.Google Scholar
Sun J, Qu Q, Wright J (2016) A geometric analysis of phase retrieval. 2016 IEEE Internat. Sympos. Inform. Theory (ISIT) (IEEE), 2379–2383.Google Scholar
Sutskever I, Martens J, Dahl G, Hinton G (2013) On the importance of initialization and momentum in deep learning. Dasgupta S, McAllester D, eds. Internat. Conf. Machine Learn. (PMLR, Atlanta), 1139–1147.Google Scholar
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. Proc. IEEE Conf. Comput. Vision Pattern Recognition (IEEE, San Juan, PR), 2818–2826.Google Scholar
Wang Y (2017) Asymptotic analysis via stochastic differential equations of gradient descent algorithms in statistical and computational paradigms. Preprint, submitted November 27, https://arxiv.org/abs/1711.09514.Google Scholar
Zhang C, Liao Q, Rakhlin A, Sridharan K, Miranda B, Golowich N, Poggio T (2017) Theory of deep learning III: Generalization properties of SGD. Technical report, Center for Brains, Minds and Machines, Massachusetts Institute of Technology, Cambridge, MA.Google Scholar
Zhou M, Liu T, Li Y, Lin D, Zhou E, Zhao T (2019) Toward understanding the importance of noise in training neural networks. Chaudhuri K, Salakhutdinov R, eds. Internat. Conf. Machine Learn. (PMLR, California), 7594–7602.Google Scholar

Volume 11, Issue 4

December 2021

Pages 307-393

Article Information

Supplemental Material

Metrics

Information

Received:October 09, 2019
Accepted:June 10, 2021
Published Online:October 21, 2021

Cite as

Tianyi Liu, Zhehui Chen, Enlu Zhou, Tuo Zhao (2021) A Diffusion Approximation Theory of Momentum Stochastic Gradient Descent in Nonconvex Optimization. Stochastic Systems 11(4):307-323.

https://doi.org/10.1287/stsy.2021.0083

Keywords

PDF download

Available Issues

Available Issues

Available Issues

A Diffusion Approximation Theory of Momentum Stochastic Gradient Descent in Nonconvex Optimization

References

Volume 11, Issue 4

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News