Global Convergence of Stochastic Gradient Hamiltonian Monte Carlo for Nonconvex Stochastic Optimization: Nonasymptotic Performance Bounds and Momentum-Based Acceleration

Xuefeng Gao
Xuefeng Gao
[email protected]
https://orcid.org/0000-0003-2424-8257
Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong;
Search for more papers by this author
,
Mert Gürbüzbalaban
Mert Gürbüzbalaban
[email protected]
https://orcid.org/0000-0002-0575-2450
Department of Management Science and Information Systems, Rutgers Business School, Piscataway, New Jersey 08854;
Search for more papers by this author
,
Lingjiong Zhu
Lingjiong Zhu
[email protected]
Department of Mathematics, Florida State University, Tallahassee, Florida 32306
Search for more papers by this author

Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong;

Search for more papers by this author

Mert Gürbüzbalaban

[email protected]

https://orcid.org/0000-0002-0575-2450

Department of Management Science and Information Systems, Rutgers Business School, Piscataway, New Jersey 08854;

Search for more papers by this author

Lingjiong Zhu

[email protected]

Department of Mathematics, Florida State University, Tallahassee, Florida 32306

Search for more papers by this author

Published Online:22 Oct 2021https://doi.org/10.1287/opre.2021.2162

References

Ahn S, Korattikara A, Welling M (2012) Bayesian posterior sampling via stochastic gradient Fisher scoring. Internat. Conf. Machine Learn., 1771–1778.Google Scholar
Allen-Zhu Z, Hazan E (2016) Variance reduction for faster non-convex optimization. Balcan MF, Weinberger KQ, eds. Proc. 33rd Internat. Conf. Machine Learn., vol. 48 (PMLR, New York), 699–707.Google Scholar
Anil C, Lucas J, Grosse R (2019) Sorting out Lipschitz function approximation. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learn., vol. 97 (PMLR, New York), 291–301.Google Scholar
Belloni A, Liang T, Narayanan H, Rakhlin A (2015) Escaping the local minima via simulated annealing: Optimization of approximately convex functions. Conf. Learn. Theory, 240–265.Google Scholar
Bertsimas D, Tsitsiklis J (1993) Simulated annealing. Statist. Sci. 8(1):10–15.Crossref, Google Scholar
Betancourt M (2017) A conceptual introduction to Hamiltonian Monte Carlo. Preprint, submitted January 10, https://arxiv.org/abs/1701.02434.Google Scholar
Betancourt M, Byrne S, Girolami M (2014) Optimizing the integrator step size for Hamiltonian Monte Carlo. Preprint, submitted November 24, https://arxiv.org/abs/1411.6669.Google Scholar
Betancourt M, Byrne S, Livingstone S, Girolami M (2017) The geometric foundations of Hamiltonian Monte Carlo. Bernoulli 23(4A):2257–2298.Crossref, Google Scholar
Bolley F, Villani C (2005) Weighted Csiszár-Kullback-Pinsker inequalities and applications to transportation inequalities. Annales de la Faculté des Sciences de Toulouse: Mathématiques 14(3):331–352.Crossref, Google Scholar
Borkar VS, Mitter SK (1999) A strong approximation theorem for stochastic recursive algorithms. J. Optim. Theory Appl. 100(3):499–513.Crossref, Google Scholar
Bovier A, Gayrard V, Klein M (2005) Metastability in reversible diffusion processes II: Precise asymptotics for small eigenvalues. J. Eur. Math. Soc. 7(1):69–99.Crossref, Google Scholar
Carmon Y, Duchi J, Hinder O, Sidford A (2018) Accelerated methods for nonconvex optimization. SIAM J. Optim. 28(2):1751–1772.Crossref, Google Scholar
Chapelle O, Do CB, Teo CH, Le QV, Smola AJ (2009) Tighter bounds for structured estimation. Koller D, Schuurmans D, Bengio Y, Bottou L, eds. Advances in Neural Information Processing Systems, vol. 21 (Curran Associates, Inc.), 281–288.Google Scholar
Chatterji NS, Flammarion N, Ma YA, Bartlett PL, Jordan MI (2018) On the theory of variance reduction for stochastic gradient Monte Carlo. Internat. Conf. Machine Learn., 764–773.Google Scholar
Chaudhari P, Choromanska A, Soatto S, LeCun Y, Baldassi C, Borgs C, Chayes J, Sagun L, Zecchina R (2017) Entropy-SGD: Biasing gradient descent into wide valleys. Internat. Conf. Learn. Representations.Google Scholar
Chen C, Ding N, Carin L (2015) On the convergence of stochastic gradient MCMC algorithms with high-order integrators. Proc. 28th Internat. Conf. Neural Inform. Processing Systems, vol. 2, 2278–2286.Google Scholar
Chen T, Fox E, Guestrin C (2014) Stochastic gradient Hamiltonian Monte Carlo. Xing EP, Jebara T, eds. Proc. 31st Internat. Conf. Machine Learn., vol. 32 (PMLR, Beijing), 1683–1691.Google Scholar
Chen C, Carlson D, Gan Z, Li C, Carin L (2016) Bridging the gap between stochastic gradient MCMC and stochastic optimization. Proc. 19th Internat. Conf. Artificial Intelligence Statist., 1051–1060.Google Scholar
Cheng X, Chatterji NS, Bartlett PL, Jordan MI (2018a) Underdamped Langevin MCMC: A non-asymptotic analysis. Proc. 31st Annual Conf. Learn. Theory.Google Scholar
Cheng X, Chatterji NS, Abbasi-Yadkori Y, Bartlett PL, Jordan MI (2018b) Sharp convergence rates for Langevin dynamics in the nonconvex setting. Preprint, submitted May 4, https://arxiv.org/abs/1805.01648.Google Scholar
Chiang TS, Hwang CR, Sheu SJ (1987) Diffusion for global optimization in Rn. SIAM J. Control Optim. 25(3):737–753.Crossref, Google Scholar
Collobert R, Sinz F, Weston J, Bottou L (2006) Trading convexity for scalability. Proc. 23rd Internat. Conf. Machine Learn., 201–208.Google Scholar
Dalalyan AS (2017) Theoretical guarantees for approximate sampling from smooth and log-concave densities. J. Roy. Statist. Soc. Ser. B 79(3):651–676.Crossref, Google Scholar
Dalalyan AS, Karagulyan AG (2019) User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. Stochastic Processes Their Appl. 129(12):5278–5311.Crossref, Google Scholar
Dalalyan AS, Riou-Durand L (2020) On sampling from a log-concave density using kinetic Langevin diffusions. Bernoulli 26(3):1956–1988.Crossref, Google Scholar
Dalalyan AS, Karagulyan A, Riou-Durand L (2019) Bounding the error of discretized Langevin algorithms for non-strongly log-concave targets. Preprint, submitted June 20, https://arxiv.org/abs/1906.08530.Google Scholar
Ding N, Fang Y, Babbush R, Chen C, Skeel RD, Neven H (2014) Bayesian sampling using stochastic gradient thermostats. Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 27 (Curran Associates, Inc.), 3203–3211.Google Scholar
Du SS, Lee JD, Tian Y, Singh A, Poczos B (2018) Gradient descent learns one-hidden-layer CNN: Don’t be afraid of spurious local minima. Internat. Conf. Machine Learn., 1339–1348.Google Scholar
Duane S, Kennedy AD, Pendleton BJ, Roweth D (1987) Hybrid Monte Carlo. Phys. Lett. B. 195(2):216–222.Crossref, Google Scholar
Eberle A, Guillin A, Zimmer R (2019) Couplings and quantitative contraction rates for Langevin dynamics. Ann. Probab. 47(4):1982–2010.Crossref, Google Scholar
Foster DJ, Sekhari A, Sridharan K (2018) Uniform convergence of gradients for non-convex learning and optimization. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inform. Processing Systems (Curran Associates, Inc.) 31:8745–8756.Google Scholar
Ge R, Lee JD, Ma T (2018) Learning one-hidden-layer neural networks with landscape design. Internat. Conf. Learn. Representations.Google Scholar
Gelfand SB, Mitter SK (1991) Recursive stochastic algorithms for global optimization in Rd. SIAM J. Control Optim. 29(5):999–1018.Crossref, Google Scholar
Ghadimi S, Lan G (2016) Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Programming 156(1):59–99.Crossref, Google Scholar
Gidas B (1985) Nonstationary Markov chains and convergence of the annealing algorithm. J. Statist. Phys. 39(1–2):73–131.Crossref, Google Scholar
Hajek B (1985) A tutorial survey of theory and applications of simulated annealing. 24th IEEE Conf. Decision Control (IEEE), 755–760.Crossref, Google Scholar
Hale J (1988) Asymptotic Behavior of Dissipative Systems, vol. 25 (American Mathematical Society).Google Scholar
Hardt M, Recht B, Singer Y (2016) Train faster, generalize better: stability of stochastic gradient descent. Internat. Conf. Machine Learn., 1225–1234.Google Scholar
Hazan E, Levy KY, Shalev-Shwartz S (2016) On graduated optimization for stochastic non-convex problems. Internat. Conf. Machine Learn., 1833–1841.Google Scholar
Hérau F, Nier F (2004) Isotropic hypoellipticity and trend to equilibrium for the Fokker-Planck equation with a high-degree potential. Arch. Rational Mechanics Anal. 171(2):151–218.Crossref, Google Scholar
Hoerl AE, Kennard RW (1970) Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12(1):55–67.Crossref, Google Scholar
Holley RA, Kusuoka S, Stroock DW (1989) Asymptotics of the spectral gap with applications to the theory of simulated annealing. J. Functional Anal. 83(2):333–347.Crossref, Google Scholar
Hu Y, Li C, Meng K, Qin J, Yang X (2017) Group sparse optimization via ℓp,q regularization. J. Machine Learn. Res. 18(1):960–1011.Google Scholar
James G, Witten D, Hastie T, Tibshirani R (2013) An Introduction to Statistical Learning, vol. 112 (Springer).Crossref, Google Scholar
Jofré A, Thompson P (2019) On variance reduction for stochastic smooth convex optimization with multiplicative noise. Math. Programming 174:253–292.Crossref, Google Scholar
Kilmer ME, O’Leary DP (2001) Choosing regularization parameters in iterative methods for ill-posed problems. SIAM J. Matrix Anal. Appl. 22(4):1204–1221.Crossref, Google Scholar
Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by simulated annealing. Sci. 220(4598):671–680.Crossref, Google Scholar
Kramers HA (1940) Brownian motion in a field of force and the diffusion model of chemical reactions. Physica 7(4):284–304.Crossref, Google Scholar
Lee YT, Vempala SS (2018) Convergence rate of Riemannian Hamiltonian Monte Carlo and faster polytope volume computation. Proc. 50th Annual ACM SIGACT Sympos. Theory Comput. (ACM), 1115–1121.Google Scholar
Leimkuhler B, Matthews C, Stoltz G (2015) The computation of averages from equilibrium and nonequilibrium Langevin molecular dynamics. IMA J. Numerical Anal. 36(1):13–79.Google Scholar
Liu T, Chen Z, Zhou E, Zhao T (2018) Toward deeper understanding of nonconvex stochastic optimization with momentum using diffusion approximations. Preprint, submitted February 14, https://arxiv.org/abs/1802.05155.Google Scholar
Lu H, Freund RM, Mirrokni V (2018) Accelerating greedy coordinate descent methods. Internat. Conf. Machine Learn., 3257–3266.Google Scholar
Ma YA, Chen T, Fox E. (2015) A complete recipe for stochastic gradient MCMC. Cortes C, Lawrence N, Lee D, Sugiyama M, GarnettR, eds. Advances in Neural Information Processing Systems, vol. 28 (Curran Associates, Inc.), 2917–2925.Google Scholar
Mangoubi O, Smith A (2017) Rapid Mixing of Hamiltonian Monte Carlo on strongly log-concave distributions. Preprint, submitted August 23, https://arxiv.org/abs/1708.07114.Google Scholar
Mangoubi O, Pillai NS, Smith A (2018) Does Hamiltonian Monte Carlo mix faster than a random walk on multimodal densities? Preprint, submitted August 9, https://arxiv.org/abs/1808.03230.Google Scholar
Mattingly JC, Stuart AM, Higham DJ (2002) Ergodicity for SDEs and approximations: Locally Lipschitz vector fields and degenerate noise. Stochastic Processes Their Appl. 101(2):185–232.Crossref, Google Scholar
Mei S, Bai Y, Montanari A (2018) The landscape of empirical risk for nonconvex losses. Ann. Statist. 46(6A):2747–2774.Crossref, Google Scholar
Neal R (2010) MCMC using Hamiltonian dynamics. Brooks S, Gelman A, Jones G, Meng X-L, eds. Handbook of Markov Chain Monte Carlo (Chapman & Hall / CRC press, Boca Raton, Florida), 113–162.Google Scholar
Nesterov YE (1983) A method for solving the convex programming problem with convergence rate o(1/k2). Dokl. Akad. Nauk SSSR. 269:543–547.Google Scholar
Nguyen T, Sanner S (2013) Algorithms for direct 0–1 loss optimization in binary classification. Internat. Conf. Machine Learn., 1085–1093.Google Scholar
O’Neill M, Wright SJ (2019) Behavior of accelerated gradient methods near critical points of nonconvex problems. Math. Programming 176:403–427.Crossref, Google Scholar
Patterson S, Teh YW (2013) Stochastic gradient Riemannian Langevin dynamics on the probability simplex. Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 26 (Curran Associates, Inc.), 3102–3110.Google Scholar
Pavliotis GA (2014) Stochastic Processes and Applications: Diffusion Processes, the Fokker-Planck and Langevin Equations, vol. 60 (Springer).Crossref, Google Scholar
Polyak BT (1987) Introduction to Optimization (Optimization Software, New York).Google Scholar
Raginsky M, Rakhlin A, Telgarsky M (2017) Non-convex learning via stochastic gradient Langevin dynamics: A nonasymptotic analysis. Conf. Learn. Theory, 1674–1703.Google Scholar
Razavi SA, Ollila E, Koivunen V (2012) Robust greedy algorithms for compressed sensing. Proc. 20th Eur. Signal Processing Conf. (IEEE), 969–973.Google Scholar
Shi B, Du SS, Jordan MI, Su WJ (2018) Understanding the acceleration phenomenon via high-resolution differential equations. Preprint, submitted October 21, https://arxiv.org/abs/1810.08907.Google Scholar
Şimşekli U, Badeau R, Cemgil T, Richard G (2016) Stochastic quasi-Newton Langevin Monte Carlo. Internat. Conf. Machine Learn., 642–651.Google Scholar
Şimşekli U, Yildiz Ç, Nguyen TH, Cemgil T, Richard G (2018) Asynchronous stochastic quasi-Newton MCMC for non-convex optimization. Dy J, Krause A, eds. Proc. 35th Internat. Conf. Machine Learn., vol. 80 (PMLR), 4674–4683.Google Scholar
Su W, Boyd S, Candes E (2014) A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights. Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 27 (Curran Associates, Inc.), 2510–2518.Google Scholar
Sun J, Qu Q, Wright J (2016) Complete dictionary recovery over the sphere I: Overview and the geometric picture. IEEE Trans. Inform. Theory 63(2):853–884.Crossref, Google Scholar
Sutskever I, Martens J, Dahl G, Hinton G (2013) On the importance of initialization and momentum in deep learning. Dasgupta S, McAllester D, eds. Proc. 30th Internat. Conf. Machine Learn., vol. 28 (PMLR, Atlanta), 1139–1147.Google Scholar
Teh YW, Thiery AH, Vollmer SJ (2016) Consistency and fluctuations for stochastic gradient Langevin dynamics. J. Machine Learn. Res. 17(1):193–225.Google Scholar
Tzen B, Liang T, Raginsky M (2018) Local optimality and generalization guarantees for the Langevin algorithm via empirical metastability. Conf. Learn. Theory, 857–875.Google Scholar
Villani C (2008) Optimal Transport: Old and New, vol. 338 (Springer Science & Business Media, Springer-Verlag Berlin Heidelberg).Google Scholar
Wang D, Chen C, Xu J (2019) Differentially private empirical risk minimization with non-convex loss functions. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learn., vol 97 (PMLR), 6526–6535.Google Scholar
Wang X, Jiang Y, Huang M, Zhang H (2013) Robust variable selection with exponential squared loss. J. Amer. Statist. Assoc. 108(502):632–643.Crossref, Google Scholar
Welling M, Teh YW (2011) Bayesian learning via stochastic gradient Langevin dynamics. Proc. 28th Internat. Conf. Machine Learn. (Citeseer), 681–688.Google Scholar
Wibisono A (2018) Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem. Bubeck S, Perchet V, Rigollet P, eds. Proc. 31st Annual Conf. Learn. Theory, vol. 75 (PMLR), 2093–3027.Google Scholar
Wilson AC, Recht B, Jordan MI (2016) A Lyapunov analysis of momentum methods in optimization. Preprint, submitted November 8, https://arxiv.org/abs/1611.02635.Google Scholar
Wu Y, Liu Y (2007) Robust truncated hinge loss support vector machines. J. Amer. Statist. Assoc. 102(479):974–983.Crossref, Google Scholar
Xu P, Chen J, Zou D, Gu Q (2018) Global convergence of Langevin dynamics based algorithms for nonconvex optimization. Bengio S, Wallach H, Larochelle H, Grauman K. Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inform. Processing Systems (Curran Associates, Inc.) 31:3122–3133.Google Scholar
Zhang Y, Liang P, Charikar M (2017a) A hitting time analysis of stochastic gradient Langevin dynamics. Kale S, Shamir O, eds. Proc. 2017 Conf. Learning Theory, vol. 65 (PMLR), 1980–2022.Google Scholar
Zhang H, Zhou Y, Liang Y, Chi Y (2017b) A nonconvex approach for phase retrieval: Reshaped Wirtinger flow and incremental algorithms. J. Machine Learn. Res. 18(1):5164–5198.Google Scholar

Volume 70, Issue 5

September-October 2022

Pages iii-vi, 2597-3033, C2-C3

Article Information

Supplemental Material

Metrics

Information

Received:August 21, 2019
Accepted:April 06, 2021
Published Online:October 22, 2021

Cite as

Xuefeng Gao, Mert Gürbüzbalaban, Lingjiong Zhu (2021) Global Convergence of Stochastic Gradient Hamiltonian Monte Carlo for Nonconvex Stochastic Optimization: Nonasymptotic Performance Bounds and Momentum-Based Acceleration. Operations Research 70(5):2931-2947.

https://doi.org/10.1287/opre.2021.2162

Keywords

Acknowledgments

The authors thank the area editor, the associate editor, and an anonymous referee for helpful comments and suggestions. They also thank Agostino Capponi, Xiuli Chao, Wenbin Chen, Jim Dai, Murat A. Erdogdu, Qi Feng, Fuqing Gao, Jianqiang Hu, Jin Ma, Sanjoy Mitter, Asuman Ozdaglar, Pablo Parrilo, Umut Şimşekli, and S. R. S. Varadhan for helpful discussions. The authors are listed in alphabetical order.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Global Convergence of Stochastic Gradient Hamiltonian Monte Carlo for Nonconvex Stochastic Optimization: Nonasymptotic Performance Bounds and Momentum-Based Acceleration

References

Volume 70, Issue 5

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News