Stationary Points of a Shallow Neural Network with Quadratic Activations and the Global Optimality of the Gradient Descent Algorithm
References
- [1] (2018) Stronger generalization bounds for deep nets via a compression approach. Preprint, submitted February 14, https://arxiv.org/abs/1802.05296.Google Scholar
- [2] (2019) Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. Preprint, submitted January 24, https://arxiv.org/abs/1901.08584.Google Scholar
- [3] (1988) Convergence to the semicircle law. Ann. Probab. 16(2):863–875.Crossref, Google Scholar
- [4] (1993) Limit of the smallest eigenvalue of a large dimensional sample covariance matrix. Ann. Probab. 21(3):1275–1294.Crossref, Google Scholar
- [5] (1994) Approximation and estimation bounds for artificial neural networks. Machine Learn. 14(1):115–133.Crossref, Google Scholar
- [6] (2017) Spectrally-normalized margin bounds for neural networks. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 6240–6249.Google Scholar
- [7] (2019) Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. J. Machine Learn. Res. 20(63):1–17.Google Scholar
- [8] (2013) Matrix Analysis, vol. 169 (Springer Science & Business Media, Berlin, Heidelberg).Google Scholar
- [9] (1989) Training a 3-node neural network is NP-complete. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 494–501.Google Scholar
- [10] (2019) Optimal approximation with sparsely connected deep neural networks. SIAM J. Math. Data Sci. 1(1):8–45.Google Scholar
- [11] (2017) Globally optimal gradient descent for a convnet with gaussian inputs. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (JMLR.org), 605–614.Google Scholar
- [12] (2017) SGD learns over-parameterized networks that provably generalize on linearly separable data. Preprint, submitted October 27, https://arxiv.org/abs/1710.10174.Google Scholar
- [13] (2014) Solving quadratic equations via phaselift when there are about as many equations as unknowns. Foundations Comput. Math. 14:1017–1026.Crossref, Google Scholar
- [14] (2011) Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements. IEEE Trans. Inform. Theory 57(4):2342–2359.Crossref, Google Scholar
- [15] (2013) Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Comm. Pure Appl. Math. 66(8):1241–1274.Crossref, Google Scholar
- [16] (2005) The zero set of a polynomial. WSMR Report 05-02, University of Windsor, Windsor, Canada.Google Scholar
- [17] (2018) Neural ordinary differential equations. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 6571–6583.Google Scholar
- [18] (2018) On the global convergence of gradient descent for over-parameterized models using optimal transport. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 3036–3046.Google Scholar
- [19] (2015) The loss surfaces of multilayer networks. Artificial Intelligence Statist. (PMLR, New York), 192–204.Google Scholar
- [20] (2008) A unified architecture for natural language processing: Deep neural networks with multitask learning. Proc. 25th Internat. Conf. Machine Learn. (ACM, New York), 160–167.Google Scholar
- [21] (2018) Clinically applicable deep learning for diagnosis and referral in retinal disease. Nature Medicine 24(9):1342–1350.Crossref, Google Scholar
- [22] (2014) Stable optimizationless recovery from phaseless linear measurements. J. Fourier Anal. Appl. 20:199–221.Crossref, Google Scholar
- [23] (2023) An improved sample complexity for rank-1 matrix sensing. Preprint, submitted March 13, https://arxiv.org/abs/2303.06895.Google Scholar
- [24] (2018) On the power of over-parametrization in neural networks with quadratic activation, Preprint, submitted March 3, https://arxiv.org/abs/1803.01206.Google Scholar
- [25] (2017) When is a convolutional filter easy to learn? Preprint, submitted September 18, https://arxiv.org/abs/1709.06129.Google Scholar
- [26] (2018) Gradient descent provably optimizes over-parameterized neural networks. Preprint, submitted October 4, https://arxiv.org/abs/1810.02054.Google Scholar
- [27] (2018) Gradient descent finds global minima of deep neural networks. Preprint, submitted November 9, https://arxiv.org/abs/1811.03804.Google Scholar
- [28] (2017) Gradient descent learns one-hidden-layer CNN: Don’t be afraid of spurious local minima. Preprint, submitted December 3, https://arxiv.org/abs/1712.00779.Google Scholar
- [29] (2017) Gradient descent can take exponential time to escape saddle points. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 1067–1077.Google Scholar
- [30] (2017) Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. Preprint, submitted March 31, https://arxiv.org/abs/1703.11008.Google Scholar
- [31] (2016) The power of depth for feedforward neural networks. Conf. Learn. Theory (PMLR, New York), 907–940.Google Scholar
- [32] (2014) Phase retrieval: Stability and recovery guarantees. Appl. Comput. Harmonic Anal. 36(3):473–494.Google Scholar
- [33] (2020) Neural networks and polynomial regression. Demystifying the overparametrization phenomena. Preprint, submitted March 23, https://arxiv.org/abs/2003.10523.Google Scholar
- [34] (2016) Topology and geometry of half-rectified network optimization. Preprint, submitted November 4, https://arxiv.org/abs/1611.01540.Google Scholar
- [35] (2000) Eigenvalues, invariant factors, highest weights, and Schubert calculus. Bull. Amer. Math. Soc. 37(3):209–249.Crossref, Google Scholar
- [36] (2017) Learning one-hidden-layer neural networks with landscape design. Preprint, submitted November 1, https://arxiv.org/abs/1711.00501.Google Scholar
- [37] (2015) Escaping from saddle points—Online stochastic gradient for tensor decomposition. Conf. Learn. Theory (PMLR, New York), 797–842.Google Scholar
- [38] (2016) Reliably learning the ReLU in polynomial time. Preprint, submitted November 30, https://arxiv.org/abs/1611.10258.Google Scholar
- [39] (2017) Size-independent sample complexity of neural networks. Preprint, submitted December 18, https://arxiv.org/abs/1712.06541.Google Scholar
- [40] (2020) Approximation bounds for random neural networks and reservoir systems. Preprint, submitted February 14, https://arxiv.org/abs/2002.05933.Google Scholar
- [41] (2017) Implicit regularization in matrix factorization. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 30.Google Scholar
- [42] (2015) Global optimality in tensor factorization, deep learning, and beyond. Preprint, submitted June 24, https://arxiv.org/abs/1506.07540.Google Scholar
- [43] (2014) Structured low-rank matrix factorization: Optimality, algorithm, and applications to image processing. Internat. Conf. Machine Learn. (PMLR, New York), 2007–2015.Google Scholar
- [44] (2016) Identity matters in deep learning. Preprint, submitted November 14, https://arxiv.org/abs/1611.04231.Google Scholar
- [45] (2017) Nearly-tight VC-dimension bounds for piecewise linear neural networks. Conf. Learn. Theory (PMLR, New York), 1064–1068.Google Scholar
- [46] (2016) Deep residual learning for image recognition. Proc. IEEE Conf. Comput. Vision Pattern Recognition (IEEE, Piscataway, NJ), 770–778.Google Scholar
- [47] (2012) Matrix Analysis (Cambridge University Press, Cambridge, MA).Crossref, Google Scholar
- [48] (2006) Extreme learning machine: Theory and applications. Neurocomputing 70(1–3):489–501.Crossref, Google Scholar
- [49] (2015) Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. Preprint, submitted June 28, https://arxiv.org/abs/1506.08473.Google Scholar
- [50] (2017) How to escape saddle points efficiently. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (JMLR.org), 1724–1732.Google Scholar
- [51] (2016) Deep learning without poor local minima. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 586–594.Google Scholar
- [52] (2021) Initialization and regularization of factorized neural layers. Preprint, submitted May 3, https://arxiv.org/abs/2105.01029.Google Scholar
- [53] (2012) Imagenet classification with deep convolutional neural networks. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 1097–1105.Google Scholar
- [54] (2016) Gradient descent only converges to minimizers. Conf. Learn. Theory (PMLR, New York), 1246–1257.Google Scholar
- [55] (2016) The power of normalization: Faster evasion of saddle points. Preprint, submitted November 15, https://arxiv.org/abs/1611.04831.Google Scholar
- [56] (2017) Convergence analysis of two-layer neural networks with ReLU activation. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 597–607.Google Scholar
- [57] (2018) Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. Conf. Learn. Theory (PMLR, New York), 2–47.Google Scholar
- [58] (2017) Fisher-Rao metric, geometry, and complexity of neural networks. Preprint, submitted November 5, https://arxiv.org/abs/1711.01530.Google Scholar
- [59] (2014) On the computational efficiency of training neural networks. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 855–863.Google Scholar
- [60] (2020) Optimization and generalization of shallow neural networks with quadratic activation functions. Preprint, submitted June 27, https://arxiv.org/abs/2006.15459.Google Scholar
- [61] (2017) Extending the scope of the small-ball method. Preprint, submitted September 4, https://arxiv.org/abs/1709.00843.Google Scholar
- [62] (2016) Learning functions: When is deep better than shallow. Preprint, submitted March 3, https://arxiv.org/abs/1603.00988.Google Scholar
- [63] (2011) Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Language Processing 20(1):14–22.Crossref, Google Scholar
- [64] (2017) A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. Preprint, submitted July 29, https://arxiv.org/abs/1707.09564.Google Scholar
- [65] (2015) Norm-based capacity control in neural networks. Conf. Learn. Theory (PMLR, New York), 1376–1401.Google Scholar
- [66] (2017) Exploring generalization in deep learning. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 5947–5956.Google Scholar
- [67] (2017) The loss surface of deep and wide neural networks. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (JMLR.org), 2603–2612.Google Scholar
- [68] (2018) The loss surface and expressivity of deep convolutional neural networks (OpenReview.net).Google Scholar
- [69] (2017) Nonlinear random matrix theory for deep learning. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 2637–2646.Google Scholar
- [70] (2017) Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review. Internat. J. Automation Comput. 14(5):503–519.Crossref, Google Scholar
- [71] (1991) Local minima and back propagation. IJCNN-91-Seattle Internat. Joint Conf. Neural Networks, vol. 2 (IEEE, Piscataway, NJ), 173–176.Google Scholar
- [72] (2023) A general algorithm for solving rank-one matrix sensing. Preprint, submitted March 22, https://arxiv.org/abs/2303.12298.Google Scholar
- [73] (2009) Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 1313–1320.Google Scholar
- [74] (2010) Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 52(3):471–501.Crossref, Google Scholar
- [75] (2018) Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. Preprint, submitted May 2, https://arxiv.org/abs/1805.00915.Google Scholar
- [76] (1964) Principles of Mathematical Analysis, vol. 3 (McGraw-Hill, New York).Google Scholar
- [77] (2017) Spurious local minima are common in two-layer ReLU neural networks. Preprint, submitted December 24, https://arxiv.org/abs/1712.08968.Google Scholar
- [78] (2017) Nonparametric regression using deep neural networks with relu activation function. Preprint, submitted August 22, https://arxiv.org/abs/1708.06633.Google Scholar
- [79] (2014) Provable methods for training neural networks with sparse connectivity. Preprint, submitted December 8, https://arxiv.org/abs/1412.2693.Google Scholar
- [80] (2017) Mastering the game of go without human knowledge. Nature 550(7676):354–359.Crossref, Google Scholar
- [81] (2020) Mean field analysis of neural networks: A central limit theorem. Stochastic Processes Appl. 130(3):1820–1852.Crossref, Google Scholar
- [82] (2017) Learning ReLUs via gradient descent. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 2007–2017.Google Scholar
- [83] (2018) Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Trans. Inform. Theory 65(2):742–769.Google Scholar
- [84] (2018) A mean field view of the landscape of two-layers neural networks. Proc. Natl. Acad. Sci. USA 115(33):E7665–E7671.Google Scholar
- [85] (2016) No bad local minima: Data independent training error guarantees for multilayer neural networks. Preprint, submitted May 26, https://arxiv.org/abs/1605.08361.Google Scholar
- [86] (2017) Exponentially vanishing sub-optimal local minima in multilayer neural networks. Preprint, submitted February 19, https://arxiv.org/abs/1702.05777.Google Scholar
- [87] (2021) Small random initialization is akin to spectral learning: Optimization and generalization guarantees for overparameterized low-rank matrix reconstruction. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), vol. 34, 23831–23843.Google Scholar
- [88] (2016) Benefits of depth in neural networks. Preprint, submitted February 14, https://arxiv.org/abs/1602.04485.Google Scholar
- [89] (2017) An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (JMLR.org), 3404–3413.Google Scholar
- [90] (2019) Spurious valleys in one-hidden-layer neural network optimization landscapes. J. Machine Learn. Res. 20(133):1–34.Google Scholar
- [91] (2010) Introduction to the non-asymptotic analysis of random matrices. Preprint, submitted November 12, https://arxiv.org/abs/1011.3027.Google Scholar
- [92] (2022) Nonlinear initialization methods for low-rank neural networks. Preprint, submitted February 2, https://arxiv.org/abs/2202.00834.Google Scholar
- [93] (2018) On the margin theory of feedforward neural networks. Preprint, submitted October 12, https://arxiv.org/abs/1810.05369.Google Scholar
- [94] (2017) Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Comm. Math. Statist. 5(4):349–380.Crossref, Google Scholar
- [95] (2017) Toward understanding generalization of deep learning: Perspective of loss landscapes. Preprint, submitted June 30, https://arxiv.org/abs/1706.10239.Google Scholar
- [96] (2016) Understanding deep learning requires rethinking generalization. Preprint, submitted November 10, https://arxiv.org/abs/1611.03530.Google Scholar
- [97] (2015) Efficient matrix sensing using rank-1 gaussian measurements. Algorithmic Learn. Theory: 26th Internat. Conf. Proc. (Springer, Berlin, Heidelberg), 3–18.Google Scholar
- [98] (2017) Learning non-overlapping convolutional neural networks with multiple kernels. Preprint, submitted November 8, https://arxiv.org/abs/1711.03440.Google Scholar
- [99] (2017) Recovery guarantees for one-hidden-layer neural networks. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (JMLR.org), 4140–4149.Google Scholar
- [100] (2017) The landscape of deep learning algorithms. Preprint, submitted May 19, https://arxiv.org/abs/1705.07038.Google Scholar
- [101] (2017) Critical points of neural networks: Analytical forms and landscape properties. Preprint, submitted October 30, https://arxiv.org/abs/1710.11205.Google Scholar

