Suboptimal Local Minima Exist for Wide Neural Networks with Smooth Activations

Published Online:https://doi.org/10.1287/moor.2021.1228

References

  • [1] Allen-Zhu Z, Li Y, Song Z (2019) A convergence theory for deep learning via over-parameterization. Internat. Conf. Machine Learn., 242–252.Google Scholar
  • [2] Arora S, Golowich N, Cohen N, Hu W (2019) A convergence analysis of gradient descent for deep linear neural networks. Seventh Internat. Conf. Learn. Representations.Google Scholar
  • [3] Auer P, Herbster M, Warmuth MK (1996) Exponentially many local minima for single neurons. Proc. Eighth Internat. Conf. Neural Inform. Processing Systems, 316–322.Google Scholar
  • [4] Bartlett P, Helmbold D, Long P (2018) Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks. Internat. Conf. Machine Learn., 521–530.Google Scholar
  • [5] Bhojanapalli S, Neyshabur B, Srebro N (2016) Global optimality of local search for low rank matrix recovery. Proc. 30th Internat. Conf. Neural Inform. Processing Systems, 3873–3881.Google Scholar
  • [6] Bianchini M, Gori M (1996) Optimal learning in artificial neural networks: A review of theoretical results. Neurocomputing 13(2–4):313–346.CrossrefGoogle Scholar
  • [7] Brutzkus A, Globerson A (2017) Globally optimal gradient descent for a ConvNet with Gaussian inputs. Internat. Conf. Machine Learn., 605–614.Google Scholar
  • [8] Brutzkus A, Globerson A, Malach E, Shalev-Shwartz S (2018) SGD learns over-parameterized networks that provably generalize on linearly separable data. Internat. Conf. Learning Representations.Google Scholar
  • [9] Chizat L, Bach F (2018) On the global convergence of gradient descent for over-parameterized models using optimal transport. Proc. 32nd Internat. Conf. Neural Inform. Processing Systems, 3040–3050.Google Scholar
  • [10] Draxler F, Veschgini K, Salmhofer M, Hamprecht F (2018) Essentially no barriers in neural network energy landscape. Internat. Conf. Machine Learn., 1309–1318.Google Scholar
  • [11] Du SS, Lee J (2018) On the power of over-parametrization in neural networks with quadratic activation. Internat. Conf. Machine Learn., 1329–1338.Google Scholar
  • [12] Du SS, Lee J, Li H, Wang L, Zhai X (2019) Gradient descent finds global minima of deep neural networks. Internat. Conf. Machine Learn., 1675–1685.Google Scholar
  • [13] Feizi S, Javadi H, Zhang J, Tse D (2017) Porcupine neural networks: (Almost) all local optima are global. Preprint, submitted October 5, https://arxiv.org/abs/1710.02196.Google Scholar
  • [14] Freeman CD, Bruna J (2017) Topology and geometry of half-rectified network optimization. Internat. Conf. Learning Representations.Google Scholar
  • [15] Gao W, Makkuva AV, Oh S, Viswanath P (2019) Learning one-hidden-layer neural networks under general input distributions. 22nd Internat. Conf. Artificial Intelligence Statistics, 1950–1959.Google Scholar
  • [16] Garipov T, Izmailov P, Podoprikhin D, Vetrov DP, Wilson AG (2018) Loss surfaces, mode connectivity, and fast ensembling of DNNs. Proc. 32nd Internat. Conf. Neural Inform. Processing Systems, 8803–8812.Google Scholar
  • [17] Ge R, Lee JD, Ma T (2016) Matrix completion has no spurious local minimum. Proc. 30th Internat. Conf. Neural Inform. Processing Systems, 2973–2981.Google Scholar
  • [18] Ge R, Lee JD, Ma T (2018) Learning one-hidden-layer neural networks with landscape design. Internat. Conf. Learn. Representations.Google Scholar
  • [19] Geiger M, Spigler S, d’Ascoli S, Sagun L, Baity-Jesi M, Biroli G, Wyart M (2019) Jamming transition as a paradigm to understand the loss landscape of deep neural networks. Physical Rev. E 100(1):012115.CrossrefGoogle Scholar
  • [20] Goldblum M, Geiping J, Schwarzschild A, Moeller M, Goldstein T (2019) Truth or backpropaganda? An empirical investigation of deep learning theory. Preprint, submitted October 1, https://arxiv.org/abs/1910.00359.Google Scholar
  • [21] Goodfellow IJ, Vinyals O, Saxe AM (2014) Qualitatively characterizing neural network optimization problems. Preprint, submitted December 19, https://arxiv.org/abs/1412.6544.Google Scholar
  • [22] Haeffele BD, Vidal R (2017) Global optimality in neural network training. Proc. IEEE Conf. Comput. Vision Pattern Recognition, 7331–7339.Google Scholar
  • [23] He F, Wang B, Tao D (2020) Piecewise linear activations substantially shape the loss surfaces of neural networks. Internat. Conf. Learn. Representations.Google Scholar
  • [24] Jacot A, Gabriel F, Hongler C (2018) Neural tangent kernel: Convergence and generalization in neural networks. Proc. 32nd Internat. Conf. Neural Inform. Processing Systems, 8571–8580.Google Scholar
  • [25] Janzamin M, Sedghi H, Anandkumar A (2015) Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. Preprint, submitted June 28, https://arxiv.org/abs/1506.08473.Google Scholar
  • [26] Kawaguchi K (2016) Deep learning without poor local minima. Proc. 30th Internat. Conf. Neural Inform. Processing Systems, 586–594.Google Scholar
  • [27] Kawaguchi K, Kaelbling LP (2020) Elimination of all bad local minima in deep learning. 23rd Internat. Conf. Artificial Intelligence Statist. 853–863.Google Scholar
  • [28] Laurent T, Brecht J (2018) Deep linear networks with arbitrary loss: All local minima are global. Internat. Conf. Machine Learn., 2908–2913.Google Scholar
  • [29] Laurent T, Brecht J (2018) The multilinear structure of ReLU networks. Internat. Conf. Machine Learn., 2908–2916.Google Scholar
  • [30] Li Y, Yuan Y (2017) Convergence analysis of two-layer neural networks with ReLU activation. Proc. 31st Internat. Conf. Neural Inform. Processing Systems, 597–607.Google Scholar
  • [31] Li D, Ding T, Sun R (2018) On the benefit of width for neural networks: Disappearance of bad basins. Preprint, submitted December 28, https://arxiv.org/abs/1812.11039.Google Scholar
  • [32] Li H, Xu Z, Taylor G, Studer C, Goldstein T (2018) Visualizing the loss landscape of neural nets. Proc. 32nd Internat. Conf. Neural Inform. Processing Systems, 6391–6401.Google Scholar
  • [33] Liang S, Srikant R (2017) Why deep neural networks for function approximation? Internat. Conf. Learn. Representations.Google Scholar
  • [34] Liang S, Sun R, Srikant R (2019) Revisiting landscape analysis in deep neural networks: Eliminating decreasing paths to infinity. Preprint, submitted December 31, https://arxiv.org/abs/1912.13472.Google Scholar
  • [35] Liang S, Sun R, Lee JD, Srikant R (2018) Adding one neuron can eliminate all bad local minima. Proc. 32nd Internat. Conf. Neural Inform. Processing Systems, 4355–4365.Google Scholar
  • [36] Liang S, Sun R, Li Y, Srikant R (2018) Understanding the loss surface of neural networks for binary classification. Internat. Conf. Machine Learn., 2835–2843.Google Scholar
  • [37] Lin D, Sun R, Zhang Z (2020) On the landscape of one-hidden-layer sparse networks and beyond. Preprint, submitted September 16, https://arxiv.org/abs/2009.07439.Google Scholar
  • [38] Liu C, Zhu L, Belkin M (2020) Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. Preprint, submitted February 29, https://arxiv.org/abs/2003.00307.Google Scholar
  • [39] Livni R, Shalev-Shwartz S, Shamir O (2014) On the computational efficiency of training neural networks. Proc. 27th Internat. Conf. Neural Inform. Processing Systems, 855–863.Google Scholar
  • [40] Lopez-Paz D, Sagun L (2018) Easing non-convex optimization with neural networks. Internat. Conf. Learn. Representations (Workshop).Google Scholar
  • [41] Lu H, Kawaguchi K (2017) Depth creates no bad local minima. Preprint, submitted February 27, https://arxiv.org/abs/1702.08580.Google Scholar
  • [42] Mei S, Montanari A, Nguyen P (2018) A mean field view of the landscape of two-layers neural networks. Proc. Natl. Acad. Sci. USA 115(33):E7665–E7671.CrossrefGoogle Scholar
  • [43] Mityagin BS (2020) The zero set of a real analytic function. Matematicheskie Zametki 107(3):473–475.Google Scholar
  • [44] Mondelli M, Montanari A (2019) On the connection between learning two-layer neural networks and tensor decomposition. 22nd Internat. Conf. Artificial Intelligence Statist., 1051–1060.Google Scholar
  • [45] Nguyen Q (2019) On connected sublevel sets in deep learning. Internat. Conf. Machine Learn., 4790–4799.Google Scholar
  • [46] Nguyen Q, Hein M (2017) The loss surface of deep and wide neural networks. Internat. Conf. Machine Learn., 2603–2612.Google Scholar
  • [47] Nguyen Q, Mukkamala MC, Hein M (2018) On the loss landscape of a class of deep neural networks with no bad local valleys. Internat. Conf. Learn. Representations.Google Scholar
  • [48] Oymak S, Soltanolkotabi M (2020) Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks. IEEE J. Selected Areas Inform. Theory 1(1):84–105.CrossrefGoogle Scholar
  • [49] Panigrahy R, Rahimi A, Sachdeva S, Zhang Q (2018) Convergence results for neural networks via electrodynamics. Ninth Innovations Theoretical Comput. Sci. Conf. (Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik).Google Scholar
  • [50] Rotskoff GM, Vanden-Eijnden E (2018) Trainability and accuracy of neural networks: An interacting particle system approach. Preprint, submitted May 2, https://arxiv.org/abs/1805.00915.Google Scholar
  • [51] Safran I, Shamir O (2017) Depth-width tradeoffs in approximating natural functions with neural networks. Internat. Conf. Machine Learn., 2979–2987.Google Scholar
  • [52] Safran I, Shamir O (2018) Spurious local minima are common in two-layer ReLU neural networks. Internat. Conf. Machine Learn., 4433–4441.Google Scholar
  • [53] Sirignano J, Spiliopoulos K (2020) Mean field analysis of neural networks: A law of large numbers. SIAM J. Appl. Math. 80(2):725–752.CrossrefGoogle Scholar
  • [54] Soltanolkotabi M (2017) Learning ReLUs via gradient descent. Proc. 31st Internat. Conf. Neural Inform. Processing Systems, 2004–2014.Google Scholar
  • [55] Soltanolkotabi M, Javanmard A, Lee JD (2019) Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Trans. Inform. Theory 65(2):742–769.CrossrefGoogle Scholar
  • [56] Soudry D, Hoffer E (2017) Exponentially vanishing sub-optimal local minima in multilayer neural networks. Preprint, submitted February 19, https://arxiv.org/abs/1702.05777.Google Scholar
  • [57] Sun R (2019) Optimization for deep learning: Theory and algorithms. Preprint, submitted December 19, https://arxiv.org/abs/1912.08957.Google Scholar
  • [58] Sun R, Luo ZQ (2016) Guaranteed matrix completion via non-convex factorization. IEEE Trans. Inform. Theory 62(11):6535–6579.CrossrefGoogle Scholar
  • [59] Sun R, Fang T, Schwing A (2020) Toward a better global loss landscape of GANs. Proc. 34th Internat. Conf. Neural Inform. Processing Systems 10186–10198.Google Scholar
  • [60] Sun R, Li D, Liang S, Ding T, Srikant R (2020) The global landscape of neural networks: An overview. IEEE Signal Processing Magazine 37(5):95–108.CrossrefGoogle Scholar
  • [61] Świrszcz G, Czarnecki WM, Pascanu R (2016) Local minima in training of neural networks. Preprint, submitted November 19, https://arxiv.org/abs/1611.06310.Google Scholar
  • [62] Telgarsky M (2016) Benefits of depth in neural networks. Conf. Learn. Theory, 1517–1539.Google Scholar
  • [63] Tian Y (2017) An analytical formula of population gradient for two-layered ReLU network and its applications in convergence and critical point analysis. Internat. Conf. Machine Learn., 3404–3413.Google Scholar
  • [64] Venturi L, Bandeira A, Bruna J (2018) Spurious valleys in two-layer neural network optimization landscapes. Preprint, submitted February 18, https://arxiv.org/abs/1802.06384.Google Scholar
  • [65] Wang G, Giannakis GB, Chen J (2019) Learning ReLU networks on linearly separable data: Algorithm, optimality, and generalization. IEEE Trans. Signal Processing 67(9):2357–2370.CrossrefGoogle Scholar
  • [66] Yu XH, Chen GA (1995) On the local minima free condition of backpropagation learning. IEEE Trans. Neural Networks 6(5):1300–1303.CrossrefGoogle Scholar
  • [67] Yu XH, Chen GA (1996) Corrections to “On the local minima free condition of the backpropagation learning.” IEEE Trans. Neural Networks 7(1):256.CrossrefGoogle Scholar
  • [68] Yun C, Sra S, Jadbabaie A (2019) Small nonlinearities in activation functions create bad local minima in neural networks. Internat. Conf. Learn. Representations.Google Scholar
  • [69] Zhang L (2019) Depth creates no more spurious local minima. Preprint, submitted January 28, https://arxiv.org/abs/1901.09827.Google Scholar
  • [70] Zhang RY, Sojoudi S, Josz C, Lavaei J (2018) How much restricted isometry is needed in nonconvex matrix recovery? Proc. 32nd Internat. Conf. Neural Inform. Processing Systems, 5591–5602.Google Scholar
  • [71] Zhong K, Song Z, Jain P, Bartlett PL, Dhillon IS (2017) Recovery guarantees for one-hidden-layer neural networks. Internat. Conf. Learn. Representations.Google Scholar
  • [72] Zhou Y, Liang Y (2018) Critical points of linear neural networks: Analytical forms and landscape properties. Internat. Conf. Learn. Representations.Google Scholar
  • [73] Zou D, Gu Q (2019) An improved analysis of training over-parameterized deep neural networks. Proc. 33rd Internat. Conf. Neural Inform. Processing Systems, 2055–2064.Google Scholar
  • [74] Zou D, Cao Y, Zhou D, Gu Q (2020) Gradient descent optimizes over-parameterized deep ReLU networks. Machine Learn. 109(3):467–492.CrossrefGoogle Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.