Suboptimal Local Minima Exist for Wide Neural Networks with Smooth Activations

Tian Ding
Tian Ding
[email protected]
https://orcid.org/0000-0002-9383-8405
Department of Information Engineering, The Chinese University of Hong Kong, Hong Kong;
Search for more papers by this author
,
Dawei Li
Dawei Li
[email protected]
https://orcid.org/0000-0003-0374-3101
Department of Industrial and Enterprise Systems Engineering and Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801
Search for more papers by this author
,
Ruoyu Sun
Ruoyu Sun
[email protected]
https://orcid.org/0000-0003-2487-5322
Department of Industrial and Enterprise Systems Engineering and Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801
Search for more papers by this author

Department of Information Engineering, The Chinese University of Hong Kong, Hong Kong;

Department of Industrial and Enterprise Systems Engineering and Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801

Search for more papers by this author

Ruoyu Sun

[email protected]

https://orcid.org/0000-0003-2487-5322

Department of Industrial and Enterprise Systems Engineering and Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801

Search for more papers by this author

Published Online:25 Mar 2022https://doi.org/10.1287/moor.2021.1228

References

[1] Allen-Zhu Z, Li Y, Song Z (2019) A convergence theory for deep learning via over-parameterization. Internat. Conf. Machine Learn., 242–252.Google Scholar
[2] Arora S, Golowich N, Cohen N, Hu W (2019) A convergence analysis of gradient descent for deep linear neural networks. Seventh Internat. Conf. Learn. Representations.Google Scholar
[3] Auer P, Herbster M, Warmuth MK (1996) Exponentially many local minima for single neurons. Proc. Eighth Internat. Conf. Neural Inform. Processing Systems, 316–322.Google Scholar
[4] Bartlett P, Helmbold D, Long P (2018) Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks. Internat. Conf. Machine Learn., 521–530.Google Scholar
[5] Bhojanapalli S, Neyshabur B, Srebro N (2016) Global optimality of local search for low rank matrix recovery. Proc. 30th Internat. Conf. Neural Inform. Processing Systems, 3873–3881.Google Scholar
[6] Bianchini M, Gori M (1996) Optimal learning in artificial neural networks: A review of theoretical results. Neurocomputing 13(2–4):313–346.Crossref, Google Scholar
[7] Brutzkus A, Globerson A (2017) Globally optimal gradient descent for a ConvNet with Gaussian inputs. Internat. Conf. Machine Learn., 605–614.Google Scholar
[8] Brutzkus A, Globerson A, Malach E, Shalev-Shwartz S (2018) SGD learns over-parameterized networks that provably generalize on linearly separable data. Internat. Conf. Learning Representations.Google Scholar
[9] Chizat L, Bach F (2018) On the global convergence of gradient descent for over-parameterized models using optimal transport. Proc. 32nd Internat. Conf. Neural Inform. Processing Systems, 3040–3050.Google Scholar
[10] Draxler F, Veschgini K, Salmhofer M, Hamprecht F (2018) Essentially no barriers in neural network energy landscape. Internat. Conf. Machine Learn., 1309–1318.Google Scholar
[11] Du SS, Lee J (2018) On the power of over-parametrization in neural networks with quadratic activation. Internat. Conf. Machine Learn., 1329–1338.Google Scholar
[12] Du SS, Lee J, Li H, Wang L, Zhai X (2019) Gradient descent finds global minima of deep neural networks. Internat. Conf. Machine Learn., 1675–1685.Google Scholar
[13] Feizi S, Javadi H, Zhang J, Tse D (2017) Porcupine neural networks: (Almost) all local optima are global. Preprint, submitted October 5, https://arxiv.org/abs/1710.02196.Google Scholar
[14] Freeman CD, Bruna J (2017) Topology and geometry of half-rectified network optimization. Internat. Conf. Learning Representations.Google Scholar
[15] Gao W, Makkuva AV, Oh S, Viswanath P (2019) Learning one-hidden-layer neural networks under general input distributions. 22nd Internat. Conf. Artificial Intelligence Statistics, 1950–1959.Google Scholar
[16] Garipov T, Izmailov P, Podoprikhin D, Vetrov DP, Wilson AG (2018) Loss surfaces, mode connectivity, and fast ensembling of DNNs. Proc. 32nd Internat. Conf. Neural Inform. Processing Systems, 8803–8812.Google Scholar
[17] Ge R, Lee JD, Ma T (2016) Matrix completion has no spurious local minimum. Proc. 30th Internat. Conf. Neural Inform. Processing Systems, 2973–2981.Google Scholar
[18] Ge R, Lee JD, Ma T (2018) Learning one-hidden-layer neural networks with landscape design. Internat. Conf. Learn. Representations.Google Scholar
[19] Geiger M, Spigler S, d’Ascoli S, Sagun L, Baity-Jesi M, Biroli G, Wyart M (2019) Jamming transition as a paradigm to understand the loss landscape of deep neural networks. Physical Rev. E 100(1):012115.Crossref, Google Scholar
[20] Goldblum M, Geiping J, Schwarzschild A, Moeller M, Goldstein T (2019) Truth or backpropaganda? An empirical investigation of deep learning theory. Preprint, submitted October 1, https://arxiv.org/abs/1910.00359.Google Scholar
[21] Goodfellow IJ, Vinyals O, Saxe AM (2014) Qualitatively characterizing neural network optimization problems. Preprint, submitted December 19, https://arxiv.org/abs/1412.6544.Google Scholar
[22] Haeffele BD, Vidal R (2017) Global optimality in neural network training. Proc. IEEE Conf. Comput. Vision Pattern Recognition, 7331–7339.Google Scholar
[23] He F, Wang B, Tao D (2020) Piecewise linear activations substantially shape the loss surfaces of neural networks. Internat. Conf. Learn. Representations.Google Scholar
[24] Jacot A, Gabriel F, Hongler C (2018) Neural tangent kernel: Convergence and generalization in neural networks. Proc. 32nd Internat. Conf. Neural Inform. Processing Systems, 8571–8580.Google Scholar
[25] Janzamin M, Sedghi H, Anandkumar A (2015) Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. Preprint, submitted June 28, https://arxiv.org/abs/1506.08473.Google Scholar
[26] Kawaguchi K (2016) Deep learning without poor local minima. Proc. 30th Internat. Conf. Neural Inform. Processing Systems, 586–594.Google Scholar
[27] Kawaguchi K, Kaelbling LP (2020) Elimination of all bad local minima in deep learning. 23rd Internat. Conf. Artificial Intelligence Statist. 853–863.Google Scholar
[28] Laurent T, Brecht J (2018) Deep linear networks with arbitrary loss: All local minima are global. Internat. Conf. Machine Learn., 2908–2913.Google Scholar
[29] Laurent T, Brecht J (2018) The multilinear structure of ReLU networks. Internat. Conf. Machine Learn., 2908–2916.Google Scholar
[30] Li Y, Yuan Y (2017) Convergence analysis of two-layer neural networks with ReLU activation. Proc. 31st Internat. Conf. Neural Inform. Processing Systems, 597–607.Google Scholar
[31] Li D, Ding T, Sun R (2018) On the benefit of width for neural networks: Disappearance of bad basins. Preprint, submitted December 28, https://arxiv.org/abs/1812.11039.Google Scholar
[32] Li H, Xu Z, Taylor G, Studer C, Goldstein T (2018) Visualizing the loss landscape of neural nets. Proc. 32nd Internat. Conf. Neural Inform. Processing Systems, 6391–6401.Google Scholar
[33] Liang S, Srikant R (2017) Why deep neural networks for function approximation? Internat. Conf. Learn. Representations.Google Scholar
[34] Liang S, Sun R, Srikant R (2019) Revisiting landscape analysis in deep neural networks: Eliminating decreasing paths to infinity. Preprint, submitted December 31, https://arxiv.org/abs/1912.13472.Google Scholar
[35] Liang S, Sun R, Lee JD, Srikant R (2018) Adding one neuron can eliminate all bad local minima. Proc. 32nd Internat. Conf. Neural Inform. Processing Systems, 4355–4365.Google Scholar
[36] Liang S, Sun R, Li Y, Srikant R (2018) Understanding the loss surface of neural networks for binary classification. Internat. Conf. Machine Learn., 2835–2843.Google Scholar
[37] Lin D, Sun R, Zhang Z (2020) On the landscape of one-hidden-layer sparse networks and beyond. Preprint, submitted September 16, https://arxiv.org/abs/2009.07439.Google Scholar
[38] Liu C, Zhu L, Belkin M (2020) Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. Preprint, submitted February 29, https://arxiv.org/abs/2003.00307.Google Scholar
[39] Livni R, Shalev-Shwartz S, Shamir O (2014) On the computational efficiency of training neural networks. Proc. 27th Internat. Conf. Neural Inform. Processing Systems, 855–863.Google Scholar
[40] Lopez-Paz D, Sagun L (2018) Easing non-convex optimization with neural networks. Internat. Conf. Learn. Representations (Workshop).Google Scholar
[41] Lu H, Kawaguchi K (2017) Depth creates no bad local minima. Preprint, submitted February 27, https://arxiv.org/abs/1702.08580.Google Scholar
[42] Mei S, Montanari A, Nguyen P (2018) A mean field view of the landscape of two-layers neural networks. Proc. Natl. Acad. Sci. USA 115(33):E7665–E7671.Crossref, Google Scholar
[43] Mityagin BS (2020) The zero set of a real analytic function. Matematicheskie Zametki 107(3):473–475.Google Scholar
[44] Mondelli M, Montanari A (2019) On the connection between learning two-layer neural networks and tensor decomposition. 22nd Internat. Conf. Artificial Intelligence Statist., 1051–1060.Google Scholar
[45] Nguyen Q (2019) On connected sublevel sets in deep learning. Internat. Conf. Machine Learn., 4790–4799.Google Scholar
[46] Nguyen Q, Hein M (2017) The loss surface of deep and wide neural networks. Internat. Conf. Machine Learn., 2603–2612.Google Scholar
[47] Nguyen Q, Mukkamala MC, Hein M (2018) On the loss landscape of a class of deep neural networks with no bad local valleys. Internat. Conf. Learn. Representations.Google Scholar
[48] Oymak S, Soltanolkotabi M (2020) Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks. IEEE J. Selected Areas Inform. Theory 1(1):84–105.Crossref, Google Scholar
[49] Panigrahy R, Rahimi A, Sachdeva S, Zhang Q (2018) Convergence results for neural networks via electrodynamics. Ninth Innovations Theoretical Comput. Sci. Conf. (Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik).Google Scholar
[50] Rotskoff GM, Vanden-Eijnden E (2018) Trainability and accuracy of neural networks: An interacting particle system approach. Preprint, submitted May 2, https://arxiv.org/abs/1805.00915.Google Scholar
[51] Safran I, Shamir O (2017) Depth-width tradeoffs in approximating natural functions with neural networks. Internat. Conf. Machine Learn., 2979–2987.Google Scholar
[52] Safran I, Shamir O (2018) Spurious local minima are common in two-layer ReLU neural networks. Internat. Conf. Machine Learn., 4433–4441.Google Scholar
[53] Sirignano J, Spiliopoulos K (2020) Mean field analysis of neural networks: A law of large numbers. SIAM J. Appl. Math. 80(2):725–752.Crossref, Google Scholar
[54] Soltanolkotabi M (2017) Learning ReLUs via gradient descent. Proc. 31st Internat. Conf. Neural Inform. Processing Systems, 2004–2014.Google Scholar
[55] Soltanolkotabi M, Javanmard A, Lee JD (2019) Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Trans. Inform. Theory 65(2):742–769.Crossref, Google Scholar
[56] Soudry D, Hoffer E (2017) Exponentially vanishing sub-optimal local minima in multilayer neural networks. Preprint, submitted February 19, https://arxiv.org/abs/1702.05777.Google Scholar
[57] Sun R (2019) Optimization for deep learning: Theory and algorithms. Preprint, submitted December 19, https://arxiv.org/abs/1912.08957.Google Scholar
[58] Sun R, Luo ZQ (2016) Guaranteed matrix completion via non-convex factorization. IEEE Trans. Inform. Theory 62(11):6535–6579.Crossref, Google Scholar
[59] Sun R, Fang T, Schwing A (2020) Toward a better global loss landscape of GANs. Proc. 34th Internat. Conf. Neural Inform. Processing Systems 10186–10198.Google Scholar
[60] Sun R, Li D, Liang S, Ding T, Srikant R (2020) The global landscape of neural networks: An overview. IEEE Signal Processing Magazine 37(5):95–108.Crossref, Google Scholar
[61] Świrszcz G, Czarnecki WM, Pascanu R (2016) Local minima in training of neural networks. Preprint, submitted November 19, https://arxiv.org/abs/1611.06310.Google Scholar
[62] Telgarsky M (2016) Benefits of depth in neural networks. Conf. Learn. Theory, 1517–1539.Google Scholar
[63] Tian Y (2017) An analytical formula of population gradient for two-layered ReLU network and its applications in convergence and critical point analysis. Internat. Conf. Machine Learn., 3404–3413.Google Scholar
[64] Venturi L, Bandeira A, Bruna J (2018) Spurious valleys in two-layer neural network optimization landscapes. Preprint, submitted February 18, https://arxiv.org/abs/1802.06384.Google Scholar
[65] Wang G, Giannakis GB, Chen J (2019) Learning ReLU networks on linearly separable data: Algorithm, optimality, and generalization. IEEE Trans. Signal Processing 67(9):2357–2370.Crossref, Google Scholar
[66] Yu XH, Chen GA (1995) On the local minima free condition of backpropagation learning. IEEE Trans. Neural Networks 6(5):1300–1303.Crossref, Google Scholar
[67] Yu XH, Chen GA (1996) Corrections to “On the local minima free condition of the backpropagation learning.” IEEE Trans. Neural Networks 7(1):256.Crossref, Google Scholar
[68] Yun C, Sra S, Jadbabaie A (2019) Small nonlinearities in activation functions create bad local minima in neural networks. Internat. Conf. Learn. Representations.Google Scholar
[69] Zhang L (2019) Depth creates no more spurious local minima. Preprint, submitted January 28, https://arxiv.org/abs/1901.09827.Google Scholar
[70] Zhang RY, Sojoudi S, Josz C, Lavaei J (2018) How much restricted isometry is needed in nonconvex matrix recovery? Proc. 32nd Internat. Conf. Neural Inform. Processing Systems, 5591–5602.Google Scholar
[71] Zhong K, Song Z, Jain P, Bartlett PL, Dhillon IS (2017) Recovery guarantees for one-hidden-layer neural networks. Internat. Conf. Learn. Representations.Google Scholar
[72] Zhou Y, Liang Y (2018) Critical points of linear neural networks: Analytical forms and landscape properties. Internat. Conf. Learn. Representations.Google Scholar
[73] Zou D, Gu Q (2019) An improved analysis of training over-parameterized deep neural networks. Proc. 33rd Internat. Conf. Neural Inform. Processing Systems, 2055–2064.Google Scholar
[74] Zou D, Cao Y, Zhou D, Gu Q (2020) Gradient descent optimizes over-parameterized deep ReLU networks. Machine Learn. 109(3):467–492.Crossref, Google Scholar

cover image Mathematics of Operations Research

Volume 47, Issue 4

November 2022

Pages 2547-3399, C2

Article Information

Metrics

Information

Received:April 10, 2020
Accepted:October 05, 2021
Published Online:March 25, 2022

Cite as

Tian Ding, Dawei Li, Ruoyu Sun (2022) Suboptimal Local Minima Exist for Wide Neural Networks with Smooth Activations. Mathematics of Operations Research 47(4):2784-2814.

https://doi.org/10.1287/moor.2021.1228

Keywords

Acknowledgments

The authors contributed equally to this paper. The work was done while author Ding was visiting the Department of Industrial and Enterprise Systems Engineering, University of Illinois at Urbana-Champaign. The authors would like to thank Constantin Christof for the helpful advice on this paper.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Suboptimal Local Minima Exist for Wide Neural Networks with Smooth Activations

References

Volume 47, Issue 4

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News