Free Access

Inexact Nonconvex Newton-Type Methods

Zhewei Yao
Zhewei Yao
[email protected]
https://orcid.org/0000-0001-7678-4321
Department of Mathematics, University of California at Berkeley, Berkeley, California 94720;
Search for more papers by this author
,
Peng Xu
Peng Xu
[email protected]
Institute for Computational and Mathematical Engineering, Stanford University, Stanford, California 94305;
Search for more papers by this author
,
Fred Roosta
Fred Roosta
[email protected]
https://orcid.org/0000-0002-6920-7072
School of Mathematics and Physics, University of Queensland, Brisbane, QLD 4072, Australia;International Computer Science Institute, Berkeley, California 94704;
Search for more papers by this author
,
Michael W. Mahoney
Michael W. Mahoney
[email protected]
International Computer Science Institute, Berkeley, California 94704;Department of Statistics, University of California at Berkeley, Berkeley, California 94720
Search for more papers by this author

Zhewei Yao

[email protected]

https://orcid.org/0000-0001-7678-4321

Department of Mathematics, University of California at Berkeley, Berkeley, California 94720;

Search for more papers by this author

Peng Xu

[email protected]

Institute for Computational and Mathematical Engineering, Stanford University, Stanford, California 94305;

Search for more papers by this author

Fred Roosta

[email protected]

https://orcid.org/0000-0002-6920-7072

School of Mathematics and Physics, University of Queensland, Brisbane, QLD 4072, Australia;International Computer Science Institute, Berkeley, California 94704;

Search for more papers by this author

Michael W. Mahoney

[email protected]

International Computer Science Institute, Berkeley, California 94704;Department of Statistics, University of California at Berkeley, Berkeley, California 94720

Search for more papers by this author

Published Online:4 Mar 2021https://doi.org/10.1287/ijoo.2019.0043

References

Ascher U, Greif C (2011) A First Course on Numerical Methods. Computational Science and Engineering (Society for Industrial and Applied Mathematics).Google Scholar
Bandeira AS, Scheinberg K, Vicente LN (2014) Convergence of trust-region methods based on probabilistic models. SIAM J. Optim. 24(3):1238–1264.Google Scholar
Beck A (2017) First-Order Methods in Optimization. MOS-SIAM Series on Optimization (Society for Industrial and Applied Mathematics, Philadelphia).Google Scholar
Berahas AS, Bollapragada R, Nocedal J (2017) An investigation of Newton-sketch and subsampled Newton methods. Optimization Methods and Software:1–20.Google Scholar
Blanchet J, Cartis C, Menickelly M, Scheinberg K (2019) Convergence rate analysis of a stochastic trust-region method via supermartingales. INFORMS J. Optim. 1(2):92–119.Link, Google Scholar
Bollapragada R, Byrd RH, Nocedal J (2018) Exact and inexact subsampled Newton methods for optimization. IMA J. Numerical Anal. 39(2):545–578.Google Scholar
Carmon Y, Duchi JC (2016) Gradient descent efficiently finds the cubic-regularized non-convex Newton step. arXiv preprint arXiv:1612.00547.Google Scholar
Cartis C, Scheinberg K (2018) Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Math. Programming 169(2):337–375.Google Scholar
Cartis C, Gould NI, Toint PL (2010) On the complexity of steepest descent, Newton’s and regularized Newton’s methods for nonconvex unconstrained optimization problems. SIAM J. Optim. 20(6):2833–2852.Google Scholar
Cartis C, Gould NI, Toint PL (2011a) Adaptive cubic regularisation methods for unconstrained optimization. Part I: Motivation, convergence and numerical results. Math. Programming 127(2):245–295.Google Scholar
Cartis C, Gould NI, Toint PL (2011b) Adaptive cubic regularisation methods for unconstrained optimization. Part II: Worst-case function-and derivative-evaluation complexity. Math. Programming 130(2):295–319.Google Scholar
Cartis C, Gould NI, Toint PL (2011c) Optimal Newton-type methods for nonconvex smooth optimization problems. ERGO technical report 11-009, School of Mathematics, University of Edinburgh, Scotland.Google Scholar
Cartis C, Gould NI, Toint PL (2012) Complexity bounds for second-order optimality in unconstrained optimization. J. Complexity 28(1):93–108.Google Scholar
Chang CC, Lin CJ (2011) LIBSVM: A library for support vector machines. ACM Trans. Intelligent Systems Tech. 2(27):1–27.Google Scholar
Chen R, Menickelly M, Scheinberg K (2015) Stochastic optimization using a trust-region method and random models. Math. Program. 169(2):447–487.Google Scholar
Chen X, Jiang B, Lin T, Zhang S (2018) Adaptively Accelerating Cubic Regularized Newton's Methods for Convex Optimization via Random Sampling. arXiv preprint arXiv:1802.05426.Google Scholar
Choromanska A, Henaff M, Mathieu M, Arous GB, LeCun Y (2015) The loss surfaces of multilayer networks. Artificial Intelligence Statist., 192–204.Google Scholar
Conn AR, Gould NI, Toint PL (2000) Trust Region Methods (SIAM, Philadelphia).Google Scholar
Conn AR, Scheinberg K, Vicente LN (2009) Global convergence of general derivative-free trust-region algorithms to first-and second-order critical points. SIAM J. Optim. 20(1):387–415.Google Scholar
Dauphin YN, Pascanu R, Gulcehre C, Cho K, Ganguli S, Bengio Y (2014) Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Adv. Neural Inform. Processing Systems 27:2933–2941.Google Scholar
Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J. Maching Learn. Res. 12(July):2121–2159.Google Scholar
Fazlyab M, Robey A, Hassani H, Morari M, Pappas G (2019) Efficient and accurate estimation of Lipschitz constants for deep neural networks. Adv. Neural Inform. Processing Systems 33:11427–11438.Google Scholar
Fukumizu K, Amari S (2000) Local minima and plateaus in hierarchical structures of multilayer perceptrons. Neural Networks 13(3):317–327.Google Scholar
Ge R, Huang F, Jin C, Yuan Y (2015) Escaping from saddle points-online stochastic gradient for tensor decomposition. COLT, 797–842.Google Scholar
Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks. Proc. 14th Internat. Conf. Artificial Intelligence Statist., 315–323.Google Scholar
Gratton S, Royer CW, Vicente LN, Zhang Z (2017) Complexity and global rates of trust-region methods based on probabilistic models. IMA J. Numerical Anal. 38(3):1579–1597.Google Scholar
Griewank A (1993) Some bounds on the complexity of gradients, Jacobians, and Hessians. Complexity in Numerical Optimization (World Scientific), 128–162.Google Scholar
He K, Zhang X, Ren S, Sun J (2016a) Deep residual learning for image recognition. Proc. IEEE Conf. Comput. Vision Pattern Recognition, 770–778.Google Scholar
He X, Mudigere D, Smelyanskiy M, Takác M (2016b) Large scale distributed Hessian-free optimization for deep neural network. arXiv preprint arXiv:1606.00511.Google Scholar
Hillar CJ, Lim LH (2013) Most tensor problems are NP-hard. J. ACM 60(6):1–39.Google Scholar
Jin C, Ge R, Netrapalli P, Kakade SM, Jordan MI (2017) How to escape saddle points efficiently. Proc. 34th Internat. Conf. Machine Learn., vol. 70, 1724–1732.Google Scholar
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.Google Scholar
Kiros R (2013) Training neural networks with stochastic Hessian-free optimization. arXiv:1301.3641.Google Scholar
Kohler JM, Lucchi A (2017) Sub-sampled cubic regularization for non-convex optimization. Proc. 34th Internat. Conf. Machine Learn., vol. 70, 1895–1904.Google Scholar
Kylasa S, Roosta F, Mahoney MW, Grama A (2019) GPU accelerated sub-sampled Newton’s method for convex classification problems. Proc. 2019 SIAM Internat. Conf. Data Mining, 702–710.Google Scholar
Lan G (2020) First-order and Stochastic Optimization Methods for Machine Learning. Springer Series in the Data Sciences (Springer International Publishing, Springer, Cham).Google Scholar
Larson J, Billups SC (2016) Stochastic derivative-free optimization using a trust region framework. Comput. Optim. Appl. 64(3):619–645.Google Scholar
LeCun YA, Bottou L, Orr GB, Müller KR (2012) Efficient backprop. Neural Networks: Tricks of the Trade (Springer, Berlin, Heidelberg), 9–48.Google Scholar
Levy KY (2016) The power of normalization: Faster evasion of saddle points. arXiv:1611.04831.Google Scholar
Lin Z, Li H, Fang C (2020) Accelerated Optimization for Machine Learning: First-Order Algorithms (Springer, Singapore).Google Scholar
Martens J (2010) Deep learning via Hessian-free optimization. Internat. Conf. Machine Learn. (Vol. 27, pp. 735–742).Google Scholar
Murty KG, Kabadi SN (1987) Some NP-complete problems in quadratic and nonlinear programming. Math. Programming 39(2):117–129.Google Scholar
Nesterov Y, Polyak BT (2006) Cubic regularization of Newton method and its global performance. Math. Programming 108(1):177–205.Google Scholar
Nocedal J, Wright S (2006) Numerical Optimization (Springer Science & Business Media, New York).Google Scholar
Pearlmutter BA (1994) Fast exact multiplication by the Hessian. Neural Comput. 6(1):147–160.Google Scholar
Pilanci M, Wainwright MJ (2015) Newton sketch: A linear-time optimization algorithm with linear-quadratic convergence. SIAM J. Optimization 27(1):205–245.Google Scholar
Regier J, Jordan MI, McAuliffe J (2017) Fast black-box variational inference through stochastic trust-region optimization. Advances in Neural Information Processing Systems (pp. 2402–2411).Google Scholar
Roosta F, Mahoney MW (2019) Sub-sampled Newton methods. Math. Programming 174(1–2):293–326.Google Scholar
Saxe AM, McClelland JL, Ganguli S (2013) Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120.Google Scholar
Shashaani S, Hashemi FS, Pasupathy R (2018) ASTRO-DF: A class of adaptive sampling trust-region algorithms for derivative-free stochastic optimization. SIAM J. Optim. 28(4):3145–3176.Google Scholar
Steihaug T (1983) The conjugate gradient method and trust regions in large scale optimization. SIAM J. Numerical Anal. 20(3):626–637.Google Scholar
Sutskever I, Martens J, Dahl G, Hinton G (2013) On the importance of initialization and momentum in deep learning. Internat. Conf. Machine Learn., 1139–1147.Google Scholar
Swirszcz G, Czarnecki WM, Pascanu R (2016) Local minima in training of deep networks. arXiv:1611.06310.Google Scholar
Tripuraneni N, Stern M, Jin C, Regier J, Jordan MI (2018) Stochastic cubic regularization for fast nonconvex optimization. Adv. Neural Inform. Processing Systems 32:2899–2908.Google Scholar
Vinyals O, Povey D (2012) Krylov subspace descent for deep learning. AISTATS, 1261–1268.Google Scholar
Wiesler S, Li J, Xue J (2013) Investigations on Hessian-free optimization for cross-entropy training of deep neural networks. INTERSPEECH, 3317–3321.Google Scholar
Xu P, Roosta F, Mahoney MW (2019) Newton-type methods for non-convex optimization under inexact Hessian information. Math. Programming 184:35–70.Google Scholar
Xu P, Roosta F, Mahoney MW (2020) Second-order optimization for non-convex machine learning: An empirical study. Proc. 2020 SIAM Internat. Conf. Data Mining (SIAM).Google Scholar
Xu P, Yang J, Roosta F, Ré C, Mahoney MW (2016) Sub-sampled Newton methods with non-uniform sampling. Adv. Neural Inform. Processing Systems 29:3000–3008.Google Scholar

cover image INFORMS Journal on Optimization

Volume 3, Issue 2

Spring 2021

Pages 119-226, C3

Article Information

Metrics

Information

Published Online:March 04, 2021

Cite as

Zhewei Yao, Peng Xu,Fred Roosta, Michael W. Mahoney (2021) Inexact Nonconvex Newton-Type Methods. INFORMS Journal on Optimization 3(2):154-182.

https://doi.org/10.1287/ijoo.2019.0043

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Inexact Nonconvex Newton-Type Methods

References

Volume 3, Issue 2

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News