Suboptimal Local Minima Exist for Wide Neural Networks with Smooth Activations
References
- [1] (2019) A convergence theory for deep learning via over-parameterization. Internat. Conf. Machine Learn., 242–252.Google Scholar
- [2] (2019) A convergence analysis of gradient descent for deep linear neural networks. Seventh Internat. Conf. Learn. Representations.Google Scholar
- [3] (1996) Exponentially many local minima for single neurons. Proc. Eighth Internat. Conf. Neural Inform. Processing Systems, 316–322.Google Scholar
- [4] (2018) Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks. Internat. Conf. Machine Learn., 521–530.Google Scholar
- [5] (2016) Global optimality of local search for low rank matrix recovery. Proc. 30th Internat. Conf. Neural Inform. Processing Systems, 3873–3881.Google Scholar
- [6] (1996) Optimal learning in artificial neural networks: A review of theoretical results. Neurocomputing 13(2–4):313–346.Crossref, Google Scholar
- [7] (2017) Globally optimal gradient descent for a ConvNet with Gaussian inputs. Internat. Conf. Machine Learn., 605–614.Google Scholar
- [8] (2018) SGD learns over-parameterized networks that provably generalize on linearly separable data. Internat. Conf. Learning Representations.Google Scholar
- [9] (2018) On the global convergence of gradient descent for over-parameterized models using optimal transport. Proc. 32nd Internat. Conf. Neural Inform. Processing Systems, 3040–3050.Google Scholar
- [10] (2018) Essentially no barriers in neural network energy landscape. Internat. Conf. Machine Learn., 1309–1318.Google Scholar
- [11] (2018) On the power of over-parametrization in neural networks with quadratic activation. Internat. Conf. Machine Learn., 1329–1338.Google Scholar
- [12] (2019) Gradient descent finds global minima of deep neural networks. Internat. Conf. Machine Learn., 1675–1685.Google Scholar
- [13] (2017) Porcupine neural networks: (Almost) all local optima are global. Preprint, submitted October 5, https://arxiv.org/abs/1710.02196.Google Scholar
- [14] (2017) Topology and geometry of half-rectified network optimization. Internat. Conf. Learning Representations.Google Scholar
- [15] (2019) Learning one-hidden-layer neural networks under general input distributions. 22nd Internat. Conf. Artificial Intelligence Statistics, 1950–1959.Google Scholar
- [16] (2018) Loss surfaces, mode connectivity, and fast ensembling of DNNs. Proc. 32nd Internat. Conf. Neural Inform. Processing Systems, 8803–8812.Google Scholar
- [17] (2016) Matrix completion has no spurious local minimum. Proc. 30th Internat. Conf. Neural Inform. Processing Systems, 2973–2981.Google Scholar
- [18] (2018) Learning one-hidden-layer neural networks with landscape design. Internat. Conf. Learn. Representations.Google Scholar
- [19] (2019) Jamming transition as a paradigm to understand the loss landscape of deep neural networks. Physical Rev. E 100(1):012115.Crossref, Google Scholar
- [20] (2019) Truth or backpropaganda? An empirical investigation of deep learning theory. Preprint, submitted October 1, https://arxiv.org/abs/1910.00359.Google Scholar
- [21] (2014) Qualitatively characterizing neural network optimization problems. Preprint, submitted December 19, https://arxiv.org/abs/1412.6544.Google Scholar
- [22] (2017) Global optimality in neural network training. Proc. IEEE Conf. Comput. Vision Pattern Recognition, 7331–7339.Google Scholar
- [23] (2020) Piecewise linear activations substantially shape the loss surfaces of neural networks. Internat. Conf. Learn. Representations.Google Scholar
- [24] (2018) Neural tangent kernel: Convergence and generalization in neural networks. Proc. 32nd Internat. Conf. Neural Inform. Processing Systems, 8571–8580.Google Scholar
- [25] (2015) Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. Preprint, submitted June 28, https://arxiv.org/abs/1506.08473.Google Scholar
- [26] (2016) Deep learning without poor local minima. Proc. 30th Internat. Conf. Neural Inform. Processing Systems, 586–594.Google Scholar
- [27] (2020) Elimination of all bad local minima in deep learning. 23rd Internat. Conf. Artificial Intelligence Statist. 853–863.Google Scholar
- [28] (2018) Deep linear networks with arbitrary loss: All local minima are global. Internat. Conf. Machine Learn., 2908–2913.Google Scholar
- [29] (2018) The multilinear structure of ReLU networks. Internat. Conf. Machine Learn., 2908–2916.Google Scholar
- [30] (2017) Convergence analysis of two-layer neural networks with ReLU activation. Proc. 31st Internat. Conf. Neural Inform. Processing Systems, 597–607.Google Scholar
- [31] (2018) On the benefit of width for neural networks: Disappearance of bad basins. Preprint, submitted December 28, https://arxiv.org/abs/1812.11039.Google Scholar
- [32] (2018) Visualizing the loss landscape of neural nets. Proc. 32nd Internat. Conf. Neural Inform. Processing Systems, 6391–6401.Google Scholar
- [33] (2017) Why deep neural networks for function approximation? Internat. Conf. Learn. Representations.Google Scholar
- [34] (2019) Revisiting landscape analysis in deep neural networks: Eliminating decreasing paths to infinity. Preprint, submitted December 31, https://arxiv.org/abs/1912.13472.Google Scholar
- [35] (2018) Adding one neuron can eliminate all bad local minima. Proc. 32nd Internat. Conf. Neural Inform. Processing Systems, 4355–4365.Google Scholar
- [36] (2018) Understanding the loss surface of neural networks for binary classification. Internat. Conf. Machine Learn., 2835–2843.Google Scholar
- [37] (2020) On the landscape of one-hidden-layer sparse networks and beyond. Preprint, submitted September 16, https://arxiv.org/abs/2009.07439.Google Scholar
- [38] (2020) Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. Preprint, submitted February 29, https://arxiv.org/abs/2003.00307.Google Scholar
- [39] (2014) On the computational efficiency of training neural networks. Proc. 27th Internat. Conf. Neural Inform. Processing Systems, 855–863.Google Scholar
- [40] (2018) Easing non-convex optimization with neural networks. Internat. Conf. Learn. Representations (Workshop).Google Scholar
- [41] (2017) Depth creates no bad local minima. Preprint, submitted February 27, https://arxiv.org/abs/1702.08580.Google Scholar
- [42] (2018) A mean field view of the landscape of two-layers neural networks. Proc. Natl. Acad. Sci. USA 115(33):E7665–E7671.Crossref, Google Scholar
- [43] (2020) The zero set of a real analytic function. Matematicheskie Zametki 107(3):473–475.Google Scholar
- [44] (2019) On the connection between learning two-layer neural networks and tensor decomposition. 22nd Internat. Conf. Artificial Intelligence Statist., 1051–1060.Google Scholar
- [45] (2019) On connected sublevel sets in deep learning. Internat. Conf. Machine Learn., 4790–4799.Google Scholar
- [46] (2017) The loss surface of deep and wide neural networks. Internat. Conf. Machine Learn., 2603–2612.Google Scholar
- [47] (2018) On the loss landscape of a class of deep neural networks with no bad local valleys. Internat. Conf. Learn. Representations.Google Scholar
- [48] (2020) Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks. IEEE J. Selected Areas Inform. Theory 1(1):84–105.Crossref, Google Scholar
- [49] (2018) Convergence results for neural networks via electrodynamics. Ninth Innovations Theoretical Comput. Sci. Conf. (Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik).Google Scholar
- [50] (2018) Trainability and accuracy of neural networks: An interacting particle system approach. Preprint, submitted May 2, https://arxiv.org/abs/1805.00915.Google Scholar
- [51] (2017) Depth-width tradeoffs in approximating natural functions with neural networks. Internat. Conf. Machine Learn., 2979–2987.Google Scholar
- [52] (2018) Spurious local minima are common in two-layer ReLU neural networks. Internat. Conf. Machine Learn., 4433–4441.Google Scholar
- [53] (2020) Mean field analysis of neural networks: A law of large numbers. SIAM J. Appl. Math. 80(2):725–752.Crossref, Google Scholar
- [54] (2017) Learning ReLUs via gradient descent. Proc. 31st Internat. Conf. Neural Inform. Processing Systems, 2004–2014.Google Scholar
- [55] (2019) Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Trans. Inform. Theory 65(2):742–769.Crossref, Google Scholar
- [56] (2017) Exponentially vanishing sub-optimal local minima in multilayer neural networks. Preprint, submitted February 19, https://arxiv.org/abs/1702.05777.Google Scholar
- [57] (2019) Optimization for deep learning: Theory and algorithms. Preprint, submitted December 19, https://arxiv.org/abs/1912.08957.Google Scholar
- [58] (2016) Guaranteed matrix completion via non-convex factorization. IEEE Trans. Inform. Theory 62(11):6535–6579.Crossref, Google Scholar
- [59] (2020) Toward a better global loss landscape of GANs. Proc. 34th Internat. Conf. Neural Inform. Processing Systems 10186–10198.Google Scholar
- [60] (2020) The global landscape of neural networks: An overview. IEEE Signal Processing Magazine 37(5):95–108.Crossref, Google Scholar
- [61] (2016) Local minima in training of neural networks. Preprint, submitted November 19, https://arxiv.org/abs/1611.06310.Google Scholar
- [62] (2016) Benefits of depth in neural networks. Conf. Learn. Theory, 1517–1539.Google Scholar
- [63] (2017) An analytical formula of population gradient for two-layered ReLU network and its applications in convergence and critical point analysis. Internat. Conf. Machine Learn., 3404–3413.Google Scholar
- [64] (2018) Spurious valleys in two-layer neural network optimization landscapes. Preprint, submitted February 18, https://arxiv.org/abs/1802.06384.Google Scholar
- [65] (2019) Learning ReLU networks on linearly separable data: Algorithm, optimality, and generalization. IEEE Trans. Signal Processing 67(9):2357–2370.Crossref, Google Scholar
- [66] (1995) On the local minima free condition of backpropagation learning. IEEE Trans. Neural Networks 6(5):1300–1303.Crossref, Google Scholar
- [67] (1996) Corrections to “On the local minima free condition of the backpropagation learning.” IEEE Trans. Neural Networks 7(1):256.Crossref, Google Scholar
- [68] (2019) Small nonlinearities in activation functions create bad local minima in neural networks. Internat. Conf. Learn. Representations.Google Scholar
- [69] (2019) Depth creates no more spurious local minima. Preprint, submitted January 28, https://arxiv.org/abs/1901.09827.Google Scholar
- [70] (2018) How much restricted isometry is needed in nonconvex matrix recovery? Proc. 32nd Internat. Conf. Neural Inform. Processing Systems, 5591–5602.Google Scholar
- [71] (2017) Recovery guarantees for one-hidden-layer neural networks. Internat. Conf. Learn. Representations.Google Scholar
- [72] (2018) Critical points of linear neural networks: Analytical forms and landscape properties. Internat. Conf. Learn. Representations.Google Scholar
- [73] (2019) An improved analysis of training over-parameterized deep neural networks. Proc. 33rd Internat. Conf. Neural Inform. Processing Systems, 2055–2064.Google Scholar
- [74] (2020) Gradient descent optimizes over-parameterized deep ReLU networks. Machine Learn. 109(3):467–492.Crossref, Google Scholar

