A Diffusion Approximation Theory of Momentum Stochastic Gradient Descent in Nonconvex Optimization
Published Online:21 Oct 2021https://doi.org/10.1287/stsy.2021.0083
References
- (2016) Tensorflow: Large-scale machine learning on heterogeneous distributed systems. Preprint, submitted March 14, https://arxiv.org/abs/1603.04467v2.Google Scholar
- (1997) Stochastic approximation with two time scales. Systems Control Lett. 29(5):291–294.Google Scholar
- (2009) Stochastic Approximation: A Dynamical Systems Viewpoint, vol. 48 (Springer, Berlin).Google Scholar
- (2000) The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim. 38(2):447–469.Google Scholar
- (2017) Online multiview representation learning: Dropping convexity for better efficiency. Preprint, submitted February 27, https://arxiv.org/abs/1702.08134v1.Google Scholar
- (2018) Dimensionality reduction for stationary time series via stochastic nonconvex optimization. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inform. Processing Systems (Curran Associates, Inc., Montreal, Quebec, Canada), 3496–3506.Google Scholar
- (2015) The loss surfaces of multilayer networks. Guy L, Vishwanathan SVN, eds. Artificial Intelligence Statist. (PMLR, California), 192–204.Google Scholar
- Fu MC, ed. (2015) Handbook of Simulation Optimization, vol. 216 (Springer, Berlin).Google Scholar
- (2016) Matrix completion has no spurious local minimum. Lee D, Sugiyama M, Luxburg U, Guyon I, Garnett R, eds. Adv. Neural Inform. Processing Systems (Curran Associates, Inc., Barcelona, Spain), 2973–2981.Google Scholar
- (2013) Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4):2341–2368.Google Scholar
- (2016) Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Programming 156(1-2):59–99.Google Scholar
- (2016) Deep Learning, vol. 1 (MIT Press, Cambridge, MA).Google Scholar
- (2016) Deep residual learning for image recognition. Proc. IEEE Conf. Comput. Vision Pattern Recognition (IEEE, San Juan, PR), 770–778.Google Scholar
- (2017) Accelerated gradient descent escapes saddle points faster than gradient descent. Preprint, submitted November 28, https://arxiv.org/abs/1711.10456.Google Scholar
- (2016) On large-batch training for deep learning: Generalization gap and sharp minima. Preprint, submitted September 15, https://arxiv.org/abs/1609.04836.Google Scholar
- (2014) Adam: A method for stochastic optimization. Preprint, submitted December 22, https://arxiv.org/abs/1412.6980.Google Scholar
- (2017) Acceleration and averaging in stochastic mirror descent dynamics. Preprint, submitted July 19, https://arxiv.org/abs/1707.06219.Google Scholar
- (2003) Stochastic Approximation and Recursive Algorithms and Applications, vol. 35 (Springer-Verlag, New York).Google Scholar
- (2017) Stochastic modified equations and adaptive stochastic gradient algorithms. Precup D, Whye Teh Y, eds. Internat. Conf. Machine Learn. (PMLR, Sydney, Australia), 2101–2110.Google Scholar
- (2016) Symmetry, saddle points, and global geometry of nonconvex matrix factorization. Preprint, submitted December 29, https://arxiv.org/abs/1612.09296.Google Scholar
- (2018) Toward understanding acceleration tradeoff between momentum and asynchrony in distributed nonconvex stochastic optimization. Preprint, submitted June 4, https://arxiv.org/abs/1806.01660.Google Scholar
- (2018) Gaussian process behaviour in wide deep neural networks. Preprint, submitted April 30, https://arxiv.org/abs/1804.11271.Google Scholar
- (2019) Mean-field theory of two-layers neural networks: Dimension-free bounds and kernel limit. Preprint, submitted February 16, https://arxiv.org/abs/1902.06015.Google Scholar
- (2018) A mean field view of the landscape of two-layer neural networks. Proc. Natl. Acad. Sci. USA 115(33):E7665–E7671.Google Scholar
- (1983) A method for unconstrained convex minimization problem with the rate of convergence O(1/k2). Proc. USSR Acad. Sci. 269:543–547.Google Scholar
- (2018) Recent trends in stochastic gradient descent for machine learning and big data. Rabe M, Juan AA, Mustafee N, Skoogh A, Jain S, Johansson B, eds. Proc. 2018 Winter Simulation Conf. (IEEE Press, Gothenburg, Sweden), 366–380.Google Scholar
- (2017) Exploring generalization in deep learning. Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Adv. Neural Inform. Processing Systems (Curran Associates, Inc., California), 5949–5958.Google Scholar
- (2003) Stochastic Differential Equations (Springer, Berlin).Google Scholar
- (2018) How to train your ResNet. Accessed September 24, 2018, https://myrtle.ai/learn/how-to-train-your-resnet/.Google Scholar
- (2019) Pytorch: An imperative style, high-performance deep learning library. Wallach H, Larochelle H, Beygelzimer A, d'Alché-Buc F, Fox E, Garnett R, eds. Adv. Neural Inform. Processing Systems (Curran Associates, Inc., Vancouver, BC, Canada), 8026–8037.Google Scholar
- (1964) Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5):1–17.Google Scholar
- (1951) A stochastic approximation method. Ann. Math. Statist. 22(3):400–407.Google Scholar
- (2018) Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. Preprint, submitted May 2, https://arxiv.org/pdf/1805.00915v1.pdf.Google Scholar
- (2018) Mean field analysis of neural networks. Preprint, submitted May 2, https://arxiv.org/abs/1805.01053.Google Scholar
- (2019) Mean field analysis of deep neural networks. Preprint, submitted March 11, https://arxiv.org/abs/1903.04440.Google Scholar
- (2016) A geometric analysis of phase retrieval. 2016 IEEE Internat. Sympos. Inform. Theory (ISIT) (IEEE), 2379–2383.Google Scholar
- (2013) On the importance of initialization and momentum in deep learning. Dasgupta S, McAllester D, eds. Internat. Conf. Machine Learn. (PMLR, Atlanta), 1139–1147.Google Scholar
- (2016) Rethinking the inception architecture for computer vision. Proc. IEEE Conf. Comput. Vision Pattern Recognition (IEEE, San Juan, PR), 2818–2826.Google Scholar
- (2017) Asymptotic analysis via stochastic differential equations of gradient descent algorithms in statistical and computational paradigms. Preprint, submitted November 27, https://arxiv.org/abs/1711.09514.Google Scholar
- (2017) Theory of deep learning III: Generalization properties of SGD. Technical report, Center for Brains, Minds and Machines, Massachusetts Institute of Technology, Cambridge, MA.Google Scholar
- (2019) Toward understanding the importance of noise in training neural networks. Chaudhuri K, Salakhutdinov R, eds. Internat. Conf. Machine Learn. (PMLR, California), 7594–7602.Google Scholar

