Stationary Points of a Shallow Neural Network with Quadratic Activations and the Global Optimality of the Gradient Descent Algorithm

Published Online:https://doi.org/10.1287/moor.2021.0082

References

  • [1] Arora S, Ge R, Neyshabur B, Zhang Y (2018) Stronger generalization bounds for deep nets via a compression approach. Preprint, submitted February 14, https://arxiv.org/abs/1802.05296.Google Scholar
  • [2] Arora S, Du SS, Hu W, Li Z, Wang R (2019) Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. Preprint, submitted January 24, https://arxiv.org/abs/1901.08584.Google Scholar
  • [3] Bai ZD, Yin YQ (1988) Convergence to the semicircle law. Ann. Probab. 16(2):863–875.CrossrefGoogle Scholar
  • [4] Bai ZD, Yin YQ (1993) Limit of the smallest eigenvalue of a large dimensional sample covariance matrix. Ann. Probab. 21(3):1275–1294.CrossrefGoogle Scholar
  • [5] Barron AR (1994) Approximation and estimation bounds for artificial neural networks. Machine Learn. 14(1):115–133.CrossrefGoogle Scholar
  • [6] Bartlett PL, Foster DJ, Telgarsky MJ (2017) Spectrally-normalized margin bounds for neural networks. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 6240–6249.Google Scholar
  • [7] Bartlett PL, Harvey N, Liaw C, Mehrabian A (2019) Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. J. Machine Learn. Res. 20(63):1–17.Google Scholar
  • [8] Bhatia R (2013) Matrix Analysis, vol. 169 (Springer Science & Business Media, Berlin, Heidelberg).Google Scholar
  • [9] Blum A, Rivest RL (1989) Training a 3-node neural network is NP-complete. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 494–501.Google Scholar
  • [10] Bölcskei H, Grohs P, Kutyniok G, Petersen P (2019) Optimal approximation with sparsely connected deep neural networks. SIAM J. Math. Data Sci. 1(1):8–45.Google Scholar
  • [11] Brutzkus A, Globerson A (2017) Globally optimal gradient descent for a convnet with gaussian inputs. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (JMLR.org), 605–614.Google Scholar
  • [12] Brutzkus A, Globerson A, Malach E, Shalev-Shwartz S (2017) SGD learns over-parameterized networks that provably generalize on linearly separable data. Preprint, submitted October 27, https://arxiv.org/abs/1710.10174.Google Scholar
  • [13] Candès EJ, Li X (2014) Solving quadratic equations via phaselift when there are about as many equations as unknowns. Foundations Comput. Math. 14:1017–1026.CrossrefGoogle Scholar
  • [14] Candès EJ, Plan Y (2011) Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements. IEEE Trans. Inform. Theory 57(4):2342–2359.CrossrefGoogle Scholar
  • [15] Candès EJ, Strohmer T, Voroninski V (2013) Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Comm. Pure Appl. Math. 66(8):1241–1274.CrossrefGoogle Scholar
  • [16] Caron R, Traynor T (2005) The zero set of a polynomial. WSMR Report 05-02, University of Windsor, Windsor, Canada.Google Scholar
  • [17] Chen TQ, Rubanova Y, Bettencourt J, Duvenaud DK (2018) Neural ordinary differential equations. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 6571–6583.Google Scholar
  • [18] Chizat L, Bach F (2018) On the global convergence of gradient descent for over-parameterized models using optimal transport. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 3036–3046.Google Scholar
  • [19] Choromanska A, Henaff M, Mathieu M, Ben Arous G, LeCun Y (2015) The loss surfaces of multilayer networks. Artificial Intelligence Statist. (PMLR, New York), 192–204.Google Scholar
  • [20] Collobert R, Weston J (2008) A unified architecture for natural language processing: Deep neural networks with multitask learning. Proc. 25th Internat. Conf. Machine Learn. (ACM, New York), 160–167.Google Scholar
  • [21] De Fauw J, Ledsam JR, Romera-Paredes B, Nikolov S, Tomasev N, Blackwell S, Askham H, et al. (2018) Clinically applicable deep learning for diagnosis and referral in retinal disease. Nature Medicine 24(9):1342–1350.CrossrefGoogle Scholar
  • [22] Demanet L, Hand P (2014) Stable optimizationless recovery from phaseless linear measurements. J. Fourier Anal. Appl. 20:199–221.CrossrefGoogle Scholar
  • [23] Deng Y, Li Z, Song Z (2023) An improved sample complexity for rank-1 matrix sensing. Preprint, submitted March 13, https://arxiv.org/abs/2303.06895.Google Scholar
  • [24] Du SS, Lee JD (2018) On the power of over-parametrization in neural networks with quadratic activation, Preprint, submitted March 3, https://arxiv.org/abs/1803.01206.Google Scholar
  • [25] Du SS, Lee JD, Tian Y (2017) When is a convolutional filter easy to learn? Preprint, submitted September 18, https://arxiv.org/abs/1709.06129.Google Scholar
  • [26] Du SS, Zhai X, Poczos B, Singh A (2018) Gradient descent provably optimizes over-parameterized neural networks. Preprint, submitted October 4, https://arxiv.org/abs/1810.02054.Google Scholar
  • [27] Du SS, Lee JD, Li H, Wang L, Zhai X (2018) Gradient descent finds global minima of deep neural networks. Preprint, submitted November 9, https://arxiv.org/abs/1811.03804.Google Scholar
  • [28] Du SS, Lee JD, Tian Y, Poczos B, Singh A (2017) Gradient descent learns one-hidden-layer CNN: Don’t be afraid of spurious local minima. Preprint, submitted December 3, https://arxiv.org/abs/1712.00779.Google Scholar
  • [29] Du SS, Jin C, Lee JD, Jordan MI, Singh A, Poczos B (2017) Gradient descent can take exponential time to escape saddle points. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 1067–1077.Google Scholar
  • [30] Dziugaite GK, Roy DM (2017) Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. Preprint, submitted March 31, https://arxiv.org/abs/1703.11008.Google Scholar
  • [31] Eldan R, Shamir O (2016) The power of depth for feedforward neural networks. Conf. Learn. Theory (PMLR, New York), 907–940.Google Scholar
  • [32] Eldar YC, Mendelson S (2014) Phase retrieval: Stability and recovery guarantees. Appl. Comput. Harmonic Anal. 36(3):473–494.Google Scholar
  • [33] Emschwiller M, Gamarnik D, Kızıldağ EC, Zadik I (2020) Neural networks and polynomial regression. Demystifying the overparametrization phenomena. Preprint, submitted March 23, https://arxiv.org/abs/2003.10523.Google Scholar
  • [34] Freeman CD, Bruna J (2016) Topology and geometry of half-rectified network optimization. Preprint, submitted November 4, https://arxiv.org/abs/1611.01540.Google Scholar
  • [35] Fulton W (2000) Eigenvalues, invariant factors, highest weights, and Schubert calculus. Bull. Amer. Math. Soc. 37(3):209–249.CrossrefGoogle Scholar
  • [36] Ge R, Lee JD, Ma T (2017) Learning one-hidden-layer neural networks with landscape design. Preprint, submitted November 1, https://arxiv.org/abs/1711.00501.Google Scholar
  • [37] Ge R, Huang F, Jin C, Yuan Y (2015) Escaping from saddle points—Online stochastic gradient for tensor decomposition. Conf. Learn. Theory (PMLR, New York), 797–842.Google Scholar
  • [38] Goel S, Kanade V, Klivans A, Thaler J (2016) Reliably learning the ReLU in polynomial time. Preprint, submitted November 30, https://arxiv.org/abs/1611.10258.Google Scholar
  • [39] Golowich N, Rakhlin A, Shamir O (2017) Size-independent sample complexity of neural networks. Preprint, submitted December 18, https://arxiv.org/abs/1712.06541.Google Scholar
  • [40] Gonon L, Grigoryeva L, Ortega J-P (2020) Approximation bounds for random neural networks and reservoir systems. Preprint, submitted February 14, https://arxiv.org/abs/2002.05933.Google Scholar
  • [41] Gunasekar S, Woodworth BE, Bhojanapalli S, Neyshabur B, Srebro N (2017) Implicit regularization in matrix factorization. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 30.Google Scholar
  • [42] Haeffele BD, Vidal R (2015) Global optimality in tensor factorization, deep learning, and beyond. Preprint, submitted June 24, https://arxiv.org/abs/1506.07540.Google Scholar
  • [43] Haeffele B, Young E, Vidal R (2014) Structured low-rank matrix factorization: Optimality, algorithm, and applications to image processing. Internat. Conf. Machine Learn. (PMLR, New York), 2007–2015.Google Scholar
  • [44] Hardt M, Ma T (2016) Identity matters in deep learning. Preprint, submitted November 14, https://arxiv.org/abs/1611.04231.Google Scholar
  • [45] Harvey N, Liaw C, Mehrabian A (2017) Nearly-tight VC-dimension bounds for piecewise linear neural networks. Conf. Learn. Theory (PMLR, New York), 1064–1068.Google Scholar
  • [46] He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. Proc. IEEE Conf. Comput. Vision Pattern Recognition (IEEE, Piscataway, NJ), 770–778.Google Scholar
  • [47] Horn RA, Johnson CR (2012) Matrix Analysis (Cambridge University Press, Cambridge, MA).CrossrefGoogle Scholar
  • [48] Huang GB, Zhu QY, Siew CK (2006) Extreme learning machine: Theory and applications. Neurocomputing 70(1–3):489–501.CrossrefGoogle Scholar
  • [49] Janzamin M, Sedghi H, Anandkumar A (2015) Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. Preprint, submitted June 28, https://arxiv.org/abs/1506.08473.Google Scholar
  • [50] Jin C, Ge R, Netrapalli P, Kakade SM, Jordan MI (2017) How to escape saddle points efficiently. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (JMLR.org), 1724–1732.Google Scholar
  • [51] Kawaguchi K (2016) Deep learning without poor local minima. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 586–594.Google Scholar
  • [52] Khodak M, Tenenholtz N, Mackey L, Fusi N (2021) Initialization and regularization of factorized neural layers. Preprint, submitted May 3, https://arxiv.org/abs/2105.01029.Google Scholar
  • [53] Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 1097–1105.Google Scholar
  • [54] Lee JD, Simchowitz M, Jordan MI, Recht B (2016) Gradient descent only converges to minimizers. Conf. Learn. Theory (PMLR, New York), 1246–1257.Google Scholar
  • [55] Levy KY (2016) The power of normalization: Faster evasion of saddle points. Preprint, submitted November 15, https://arxiv.org/abs/1611.04831.Google Scholar
  • [56] Li Y, Yuan Y (2017) Convergence analysis of two-layer neural networks with ReLU activation. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 597–607.Google Scholar
  • [57] Li Y, Ma T, Zhang H (2018) Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. Conf. Learn. Theory (PMLR, New York), 2–47.Google Scholar
  • [58] Liang T, Poggio T, Rakhlin A, Stokes J (2017) Fisher-Rao metric, geometry, and complexity of neural networks. Preprint, submitted November 5, https://arxiv.org/abs/1711.01530.Google Scholar
  • [59] Livni R, Shalev-Shwartz S, Shamir O (2014) On the computational efficiency of training neural networks. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 855–863.Google Scholar
  • [60] Mannelli SS, Vanden-Eijnden E, Zdeborová L (2020) Optimization and generalization of shallow neural networks with quadratic activation functions. Preprint, submitted June 27, https://arxiv.org/abs/2006.15459.Google Scholar
  • [61] Mendelson S (2017) Extending the scope of the small-ball method. Preprint, submitted September 4, https://arxiv.org/abs/1709.00843.Google Scholar
  • [62] Mhaskar H, Liao Q, Poggio T (2016) Learning functions: When is deep better than shallow. Preprint, submitted March 3, https://arxiv.org/abs/1603.00988.Google Scholar
  • [63] Mohamed AR, Dahl GE, Hinton G (2011) Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Language Processing 20(1):14–22.CrossrefGoogle Scholar
  • [64] Neyshabur B, Bhojanapalli S, Srebro N (2017) A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. Preprint, submitted July 29, https://arxiv.org/abs/1707.09564.Google Scholar
  • [65] Neyshabur B, Tomioka R, Srebro N (2015) Norm-based capacity control in neural networks. Conf. Learn. Theory (PMLR, New York), 1376–1401.Google Scholar
  • [66] Neyshabur B, Bhojanapalli S, McAllester D, Srebro N (2017) Exploring generalization in deep learning. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 5947–5956.Google Scholar
  • [67] Nguyen Q, Hein M (2017) The loss surface of deep and wide neural networks. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (JMLR.org), 2603–2612.Google Scholar
  • [68] Nguyen Q, Hein M (2018) The loss surface and expressivity of deep convolutional neural networks (OpenReview.net).Google Scholar
  • [69] Pennington J, Worah P (2017) Nonlinear random matrix theory for deep learning. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 2637–2646.Google Scholar
  • [70] Poggio T, Mhaskar H, Rosasco L, Miranda B, Liao Q (2017) Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review. Internat. J. Automation Comput. 14(5):503–519.CrossrefGoogle Scholar
  • [71] Poston T, Lee CN, Choie Y, Kwon Y (1991) Local minima and back propagation. IJCNN-91-Seattle Internat. Joint Conf. Neural Networks, vol. 2 (IEEE, Piscataway, NJ), 173–176.Google Scholar
  • [72] Qin L, Song Z, Zhang R (2023) A general algorithm for solving rank-one matrix sensing. Preprint, submitted March 22, https://arxiv.org/abs/2303.12298.Google Scholar
  • [73] Rahimi A, Recht B (2009) Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 1313–1320.Google Scholar
  • [74] Recht B, Fazel M, Parrilo PA (2010) Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 52(3):471–501.CrossrefGoogle Scholar
  • [75] Rotskoff GM, Vanden-Eijnden E (2018) Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. Preprint, submitted May 2, https://arxiv.org/abs/1805.00915.Google Scholar
  • [76] Rudin W (1964) Principles of Mathematical Analysis, vol. 3 (McGraw-Hill, New York).Google Scholar
  • [77] Safran I, Shamir O (2017) Spurious local minima are common in two-layer ReLU neural networks. Preprint, submitted December 24, https://arxiv.org/abs/1712.08968.Google Scholar
  • [78] Schmidt-Hieber J (2017) Nonparametric regression using deep neural networks with relu activation function. Preprint, submitted August 22, https://arxiv.org/abs/1708.06633.Google Scholar
  • [79] Sedghi H, Anandkumar A (2014) Provable methods for training neural networks with sparse connectivity. Preprint, submitted December 8, https://arxiv.org/abs/1412.2693.Google Scholar
  • [80] Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, et al. (2017) Mastering the game of go without human knowledge. Nature 550(7676):354–359.CrossrefGoogle Scholar
  • [81] Sirignano J, Spiliopoulos K (2020) Mean field analysis of neural networks: A central limit theorem. Stochastic Processes Appl. 130(3):1820–1852.CrossrefGoogle Scholar
  • [82] Soltanolkotabi M (2017) Learning ReLUs via gradient descent. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 2007–2017.Google Scholar
  • [83] Soltanolkotabi M, Javanmard A, Lee JD (2018) Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Trans. Inform. Theory 65(2):742–769.Google Scholar
  • [84] Song M, Montanari A, Nguyen P (2018) A mean field view of the landscape of two-layers neural networks. Proc. Natl. Acad. Sci. USA 115(33):E7665–E7671.Google Scholar
  • [85] Soudry D, Carmon Y (2016) No bad local minima: Data independent training error guarantees for multilayer neural networks. Preprint, submitted May 26, https://arxiv.org/abs/1605.08361.Google Scholar
  • [86] Soudry D, Hoffer E (2017) Exponentially vanishing sub-optimal local minima in multilayer neural networks. Preprint, submitted February 19, https://arxiv.org/abs/1702.05777.Google Scholar
  • [87] Stöger D, Soltanolkotabi M (2021) Small random initialization is akin to spectral learning: Optimization and generalization guarantees for overparameterized low-rank matrix reconstruction. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), vol. 34, 23831–23843.Google Scholar
  • [88] Telgarsky M (2016) Benefits of depth in neural networks. Preprint, submitted February 14, https://arxiv.org/abs/1602.04485.Google Scholar
  • [89] Tian Y (2017) An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (JMLR.org), 3404–3413.Google Scholar
  • [90] Venturi L, Bandeira AS, Bruna J (2019) Spurious valleys in one-hidden-layer neural network optimization landscapes. J. Machine Learn. Res. 20(133):1–34.Google Scholar
  • [91] Vershynin R (2010) Introduction to the non-asymptotic analysis of random matrices. Preprint, submitted November 12, https://arxiv.org/abs/1011.3027.Google Scholar
  • [92] Vodrahalli K, Shivanna R, Sathiamoorthy M, Jain S, Chi EH (2022) Nonlinear initialization methods for low-rank neural networks. Preprint, submitted February 2, https://arxiv.org/abs/2202.00834.Google Scholar
  • [93] Wei C, Lee JD, Liu Q, Ma T (2018) On the margin theory of feedforward neural networks. Preprint, submitted October 12, https://arxiv.org/abs/1810.05369.Google Scholar
  • [94] Weinan E, Han J, Jentzen A (2017) Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Comm. Math. Statist. 5(4):349–380.CrossrefGoogle Scholar
  • [95] Wu L, Zhu Z, W E (2017) Toward understanding generalization of deep learning: Perspective of loss landscapes. Preprint, submitted June 30, https://arxiv.org/abs/1706.10239.Google Scholar
  • [96] Zhang C, Bengio S, Hardt M, Recht B, Vinyals O (2016) Understanding deep learning requires rethinking generalization. Preprint, submitted November 10, https://arxiv.org/abs/1611.03530.Google Scholar
  • [97] Zhong K, Jain P, Dhillon IS (2015) Efficient matrix sensing using rank-1 gaussian measurements. Algorithmic Learn. Theory: 26th Internat. Conf. Proc. (Springer, Berlin, Heidelberg), 3–18.Google Scholar
  • [98] Zhong K, Song Z, Dhillon IS (2017) Learning non-overlapping convolutional neural networks with multiple kernels. Preprint, submitted November 8, https://arxiv.org/abs/1711.03440.Google Scholar
  • [99] Zhong K, Song Z, Jain P, Bartlett PL, Dhillon IS (2017) Recovery guarantees for one-hidden-layer neural networks. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (JMLR.org), 4140–4149.Google Scholar
  • [100] Zhou P, Feng J (2017) The landscape of deep learning algorithms. Preprint, submitted May 19, https://arxiv.org/abs/1705.07038.Google Scholar
  • [101] Zhou Y, Liang Y (2017) Critical points of neural networks: Analytical forms and landscape properties. Preprint, submitted October 30, https://arxiv.org/abs/1710.11205.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.