Inexact Nonconvex Newton-Type Methods

Published Online:https://doi.org/10.1287/ijoo.2019.0043

References

  • Ascher U, Greif C (2011) A First Course on Numerical Methods. Computational Science and Engineering (Society for Industrial and Applied Mathematics).Google Scholar
  • Bandeira AS, Scheinberg K, Vicente LN (2014) Convergence of trust-region methods based on probabilistic models. SIAM J. Optim. 24(3):1238–1264.Google Scholar
  • Beck A (2017) First-Order Methods in Optimization. MOS-SIAM Series on Optimization (Society for Industrial and Applied Mathematics, Philadelphia).Google Scholar
  • Berahas AS, Bollapragada R, Nocedal J (2017) An investigation of Newton-sketch and subsampled Newton methods. Optimization Methods and Software:1–20.Google Scholar
  • Blanchet J, Cartis C, Menickelly M, Scheinberg K (2019) Convergence rate analysis of a stochastic trust-region method via supermartingales. INFORMS J. Optim. 1(2):92–119.LinkGoogle Scholar
  • Bollapragada R, Byrd RH, Nocedal J (2018) Exact and inexact subsampled Newton methods for optimization. IMA J. Numerical Anal. 39(2):545–578.Google Scholar
  • Carmon Y, Duchi JC (2016) Gradient descent efficiently finds the cubic-regularized non-convex Newton step. arXiv preprint arXiv:1612.00547.Google Scholar
  • Cartis C, Scheinberg K (2018) Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Math. Programming 169(2):337–375.Google Scholar
  • Cartis C, Gould NI, Toint PL (2010) On the complexity of steepest descent, Newton’s and regularized Newton’s methods for nonconvex unconstrained optimization problems. SIAM J. Optim. 20(6):2833–2852.Google Scholar
  • Cartis C, Gould NI, Toint PL (2011a) Adaptive cubic regularisation methods for unconstrained optimization. Part I: Motivation, convergence and numerical results. Math. Programming 127(2):245–295.Google Scholar
  • Cartis C, Gould NI, Toint PL (2011b) Adaptive cubic regularisation methods for unconstrained optimization. Part II: Worst-case function-and derivative-evaluation complexity. Math. Programming 130(2):295–319.Google Scholar
  • Cartis C, Gould NI, Toint PL (2011c) Optimal Newton-type methods for nonconvex smooth optimization problems. ERGO technical report 11-009, School of Mathematics, University of Edinburgh, Scotland.Google Scholar
  • Cartis C, Gould NI, Toint PL (2012) Complexity bounds for second-order optimality in unconstrained optimization. J. Complexity 28(1):93–108.Google Scholar
  • Chang CC, Lin CJ (2011) LIBSVM: A library for support vector machines. ACM Trans. Intelligent Systems Tech. 2(27):1–27.Google Scholar
  • Chen R, Menickelly M, Scheinberg K (2015) Stochastic optimization using a trust-region method and random models. Math. Program. 169(2):447–487.Google Scholar
  • Chen X, Jiang B, Lin T, Zhang S (2018) Adaptively Accelerating Cubic Regularized Newton's Methods for Convex Optimization via Random Sampling. arXiv preprint arXiv:1802.05426.Google Scholar
  • Choromanska A, Henaff M, Mathieu M, Arous GB, LeCun Y (2015) The loss surfaces of multilayer networks. Artificial Intelligence Statist., 192–204.Google Scholar
  • Conn AR, Gould NI, Toint PL (2000) Trust Region Methods (SIAM, Philadelphia).Google Scholar
  • Conn AR, Scheinberg K, Vicente LN (2009) Global convergence of general derivative-free trust-region algorithms to first-and second-order critical points. SIAM J. Optim. 20(1):387–415.Google Scholar
  • Dauphin YN, Pascanu R, Gulcehre C, Cho K, Ganguli S, Bengio Y (2014) Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Adv. Neural Inform. Processing Systems 27:2933–2941.Google Scholar
  • Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J. Maching Learn. Res. 12(July):2121–2159.Google Scholar
  • Fazlyab M, Robey A, Hassani H, Morari M, Pappas G (2019) Efficient and accurate estimation of Lipschitz constants for deep neural networks. Adv. Neural Inform. Processing Systems 33:11427–11438.Google Scholar
  • Fukumizu K, Amari S (2000) Local minima and plateaus in hierarchical structures of multilayer perceptrons. Neural Networks 13(3):317–327.Google Scholar
  • Ge R, Huang F, Jin C, Yuan Y (2015) Escaping from saddle points-online stochastic gradient for tensor decomposition. COLT, 797–842.Google Scholar
  • Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks. Proc. 14th Internat. Conf. Artificial Intelligence Statist., 315–323.Google Scholar
  • Gratton S, Royer CW, Vicente LN, Zhang Z (2017) Complexity and global rates of trust-region methods based on probabilistic models. IMA J. Numerical Anal. 38(3):1579–1597.Google Scholar
  • Griewank A (1993) Some bounds on the complexity of gradients, Jacobians, and Hessians. Complexity in Numerical Optimization (World Scientific), 128–162.Google Scholar
  • He K, Zhang X, Ren S, Sun J (2016a) Deep residual learning for image recognition. Proc. IEEE Conf. Comput. Vision Pattern Recognition, 770–778.Google Scholar
  • He X, Mudigere D, Smelyanskiy M, Takác M (2016b) Large scale distributed Hessian-free optimization for deep neural network. arXiv preprint arXiv:1606.00511.Google Scholar
  • Hillar CJ, Lim LH (2013) Most tensor problems are NP-hard. J. ACM 60(6):1–39.Google Scholar
  • Jin C, Ge R, Netrapalli P, Kakade SM, Jordan MI (2017) How to escape saddle points efficiently. Proc. 34th Internat. Conf. Machine Learn., vol. 70, 1724–1732.Google Scholar
  • Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.Google Scholar
  • Kiros R (2013) Training neural networks with stochastic Hessian-free optimization. arXiv:1301.3641.Google Scholar
  • Kohler JM, Lucchi A (2017) Sub-sampled cubic regularization for non-convex optimization. Proc. 34th Internat. Conf. Machine Learn., vol. 70, 1895–1904.Google Scholar
  • Kylasa S, Roosta F, Mahoney MW, Grama A (2019) GPU accelerated sub-sampled Newton’s method for convex classification problems. Proc. 2019 SIAM Internat. Conf. Data Mining, 702–710.Google Scholar
  • Lan G (2020) First-order and Stochastic Optimization Methods for Machine Learning. Springer Series in the Data Sciences (Springer International Publishing, Springer, Cham).Google Scholar
  • Larson J, Billups SC (2016) Stochastic derivative-free optimization using a trust region framework. Comput. Optim. Appl. 64(3):619–645.Google Scholar
  • LeCun YA, Bottou L, Orr GB, Müller KR (2012) Efficient backprop. Neural Networks: Tricks of the Trade (Springer, Berlin, Heidelberg), 9–48.Google Scholar
  • Levy KY (2016) The power of normalization: Faster evasion of saddle points. arXiv:1611.04831.Google Scholar
  • Lin Z, Li H, Fang C (2020) Accelerated Optimization for Machine Learning: First-Order Algorithms (Springer, Singapore).Google Scholar
  • Martens J (2010) Deep learning via Hessian-free optimization. Internat. Conf. Machine Learn. (Vol. 27, pp. 735–742).Google Scholar
  • Murty KG, Kabadi SN (1987) Some NP-complete problems in quadratic and nonlinear programming. Math. Programming 39(2):117–129.Google Scholar
  • Nesterov Y, Polyak BT (2006) Cubic regularization of Newton method and its global performance. Math. Programming 108(1):177–205.Google Scholar
  • Nocedal J, Wright S (2006) Numerical Optimization (Springer Science & Business Media, New York).Google Scholar
  • Pearlmutter BA (1994) Fast exact multiplication by the Hessian. Neural Comput. 6(1):147–160.Google Scholar
  • Pilanci M, Wainwright MJ (2015) Newton sketch: A linear-time optimization algorithm with linear-quadratic convergence. SIAM J. Optimization 27(1):205–245.Google Scholar
  • Regier J, Jordan MI, McAuliffe J (2017) Fast black-box variational inference through stochastic trust-region optimization. Advances in Neural Information Processing Systems (pp. 2402–2411).Google Scholar
  • Roosta F, Mahoney MW (2019) Sub-sampled Newton methods. Math. Programming 174(1–2):293–326.Google Scholar
  • Saxe AM, McClelland JL, Ganguli S (2013) Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120.Google Scholar
  • Shashaani S, Hashemi FS, Pasupathy R (2018) ASTRO-DF: A class of adaptive sampling trust-region algorithms for derivative-free stochastic optimization. SIAM J. Optim. 28(4):3145–3176.Google Scholar
  • Steihaug T (1983) The conjugate gradient method and trust regions in large scale optimization. SIAM J. Numerical Anal. 20(3):626–637.Google Scholar
  • Sutskever I, Martens J, Dahl G, Hinton G (2013) On the importance of initialization and momentum in deep learning. Internat. Conf. Machine Learn., 1139–1147.Google Scholar
  • Swirszcz G, Czarnecki WM, Pascanu R (2016) Local minima in training of deep networks. arXiv:1611.06310.Google Scholar
  • Tripuraneni N, Stern M, Jin C, Regier J, Jordan MI (2018) Stochastic cubic regularization for fast nonconvex optimization. Adv. Neural Inform. Processing Systems 32:2899–2908.Google Scholar
  • Vinyals O, Povey D (2012) Krylov subspace descent for deep learning. AISTATS, 1261–1268.Google Scholar
  • Wiesler S, Li J, Xue J (2013) Investigations on Hessian-free optimization for cross-entropy training of deep neural networks. INTERSPEECH, 3317–3321.Google Scholar
  • Xu P, Roosta F, Mahoney MW (2019) Newton-type methods for non-convex optimization under inexact Hessian information. Math. Programming 184:35–70.Google Scholar
  • Xu P, Roosta F, Mahoney MW (2020) Second-order optimization for non-convex machine learning: An empirical study. Proc. 2020 SIAM Internat. Conf. Data Mining (SIAM).Google Scholar
  • Xu P, Yang J, Roosta F, Ré C, Mahoney MW (2016) Sub-sampled Newton methods with non-uniform sampling. Adv. Neural Inform. Processing Systems 29:3000–3008.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.