Global Convergence of Stochastic Gradient Hamiltonian Monte Carlo for Nonconvex Stochastic Optimization: Nonasymptotic Performance Bounds and Momentum-Based Acceleration

Published Online:https://doi.org/10.1287/opre.2021.2162

References

  • Ahn S, Korattikara A, Welling M (2012) Bayesian posterior sampling via stochastic gradient Fisher scoring. Internat. Conf. Machine Learn., 1771–1778.Google Scholar
  • Allen-Zhu Z, Hazan E (2016) Variance reduction for faster non-convex optimization. Balcan MF, Weinberger KQ, eds. Proc. 33rd Internat. Conf. Machine Learn., vol. 48 (PMLR, New York), 699–707.Google Scholar
  • Anil C, Lucas J, Grosse R (2019) Sorting out Lipschitz function approximation. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learn., vol. 97 (PMLR, New York), 291–301.Google Scholar
  • Belloni A, Liang T, Narayanan H, Rakhlin A (2015) Escaping the local minima via simulated annealing: Optimization of approximately convex functions. Conf. Learn. Theory, 240–265.Google Scholar
  • Bertsimas D, Tsitsiklis J (1993) Simulated annealing. Statist. Sci. 8(1):10–15.CrossrefGoogle Scholar
  • Betancourt M (2017) A conceptual introduction to Hamiltonian Monte Carlo. Preprint, submitted January 10, https://arxiv.org/abs/1701.02434.Google Scholar
  • Betancourt M, Byrne S, Girolami M (2014) Optimizing the integrator step size for Hamiltonian Monte Carlo. Preprint, submitted November 24, https://arxiv.org/abs/1411.6669.Google Scholar
  • Betancourt M, Byrne S, Livingstone S, Girolami M (2017) The geometric foundations of Hamiltonian Monte Carlo. Bernoulli 23(4A):2257–2298.CrossrefGoogle Scholar
  • Bolley F, Villani C (2005) Weighted Csiszár-Kullback-Pinsker inequalities and applications to transportation inequalities. Annales de la Faculté des Sciences de Toulouse: Mathématiques 14(3):331–352.CrossrefGoogle Scholar
  • Borkar VS, Mitter SK (1999) A strong approximation theorem for stochastic recursive algorithms. J. Optim. Theory Appl. 100(3):499–513.CrossrefGoogle Scholar
  • Bovier A, Gayrard V, Klein M (2005) Metastability in reversible diffusion processes II: Precise asymptotics for small eigenvalues. J. Eur. Math. Soc. 7(1):69–99.CrossrefGoogle Scholar
  • Carmon Y, Duchi J, Hinder O, Sidford A (2018) Accelerated methods for nonconvex optimization. SIAM J. Optim. 28(2):1751–1772.CrossrefGoogle Scholar
  • Chapelle O, Do CB, Teo CH, Le QV, Smola AJ (2009) Tighter bounds for structured estimation. Koller D, Schuurmans D, Bengio Y, Bottou L, eds. Advances in Neural Information Processing Systems, vol. 21 (Curran Associates, Inc.), 281–288.Google Scholar
  • Chatterji NS, Flammarion N, Ma YA, Bartlett PL, Jordan MI (2018) On the theory of variance reduction for stochastic gradient Monte Carlo. Internat. Conf. Machine Learn., 764–773.Google Scholar
  • Chaudhari P, Choromanska A, Soatto S, LeCun Y, Baldassi C, Borgs C, Chayes J, Sagun L, Zecchina R (2017) Entropy-SGD: Biasing gradient descent into wide valleys. Internat. Conf. Learn. Representations.Google Scholar
  • Chen C, Ding N, Carin L (2015) On the convergence of stochastic gradient MCMC algorithms with high-order integrators. Proc. 28th Internat. Conf. Neural Inform. Processing Systems, vol. 2, 2278–2286.Google Scholar
  • Chen T, Fox E, Guestrin C (2014) Stochastic gradient Hamiltonian Monte Carlo. Xing EP, Jebara T, eds. Proc. 31st Internat. Conf. Machine Learn., vol. 32 (PMLR, Beijing), 1683–1691.Google Scholar
  • Chen C, Carlson D, Gan Z, Li C, Carin L (2016) Bridging the gap between stochastic gradient MCMC and stochastic optimization. Proc. 19th Internat. Conf. Artificial Intelligence Statist., 1051–1060.Google Scholar
  • Cheng X, Chatterji NS, Bartlett PL, Jordan MI (2018a) Underdamped Langevin MCMC: A non-asymptotic analysis. Proc. 31st Annual Conf. Learn. Theory.Google Scholar
  • Cheng X, Chatterji NS, Abbasi-Yadkori Y, Bartlett PL, Jordan MI (2018b) Sharp convergence rates for Langevin dynamics in the nonconvex setting. Preprint, submitted May 4, https://arxiv.org/abs/1805.01648.Google Scholar
  • Chiang TS, Hwang CR, Sheu SJ (1987) Diffusion for global optimization in Rn. SIAM J. Control Optim. 25(3):737–753.CrossrefGoogle Scholar
  • Collobert R, Sinz F, Weston J, Bottou L (2006) Trading convexity for scalability. Proc. 23rd Internat. Conf. Machine Learn., 201–208.Google Scholar
  • Dalalyan AS (2017) Theoretical guarantees for approximate sampling from smooth and log-concave densities. J. Roy. Statist. Soc. Ser. B 79(3):651–676.CrossrefGoogle Scholar
  • Dalalyan AS, Karagulyan AG (2019) User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. Stochastic Processes Their Appl. 129(12):5278–5311.CrossrefGoogle Scholar
  • Dalalyan AS, Riou-Durand L (2020) On sampling from a log-concave density using kinetic Langevin diffusions. Bernoulli 26(3):1956–1988.CrossrefGoogle Scholar
  • Dalalyan AS, Karagulyan A, Riou-Durand L (2019) Bounding the error of discretized Langevin algorithms for non-strongly log-concave targets. Preprint, submitted June 20, https://arxiv.org/abs/1906.08530.Google Scholar
  • Ding N, Fang Y, Babbush R, Chen C, Skeel RD, Neven H (2014) Bayesian sampling using stochastic gradient thermostats. Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 27 (Curran Associates, Inc.), 3203–3211.Google Scholar
  • Du SS, Lee JD, Tian Y, Singh A, Poczos B (2018) Gradient descent learns one-hidden-layer CNN: Don’t be afraid of spurious local minima. Internat. Conf. Machine Learn., 1339–1348.Google Scholar
  • Duane S, Kennedy AD, Pendleton BJ, Roweth D (1987) Hybrid Monte Carlo. Phys. Lett. B. 195(2):216–222.CrossrefGoogle Scholar
  • Eberle A, Guillin A, Zimmer R (2019) Couplings and quantitative contraction rates for Langevin dynamics. Ann. Probab. 47(4):1982–2010.CrossrefGoogle Scholar
  • Foster DJ, Sekhari A, Sridharan K (2018) Uniform convergence of gradients for non-convex learning and optimization. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inform. Processing Systems (Curran Associates, Inc.) 31:8745–8756.Google Scholar
  • Ge R, Lee JD, Ma T (2018) Learning one-hidden-layer neural networks with landscape design. Internat. Conf. Learn. Representations.Google Scholar
  • Gelfand SB, Mitter SK (1991) Recursive stochastic algorithms for global optimization in Rd. SIAM J. Control Optim. 29(5):999–1018.CrossrefGoogle Scholar
  • Ghadimi S, Lan G (2016) Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Programming 156(1):59–99.CrossrefGoogle Scholar
  • Gidas B (1985) Nonstationary Markov chains and convergence of the annealing algorithm. J. Statist. Phys. 39(1–2):73–131.CrossrefGoogle Scholar
  • Hajek B (1985) A tutorial survey of theory and applications of simulated annealing. 24th IEEE Conf. Decision Control (IEEE), 755–760.CrossrefGoogle Scholar
  • Hale J (1988) Asymptotic Behavior of Dissipative Systems, vol. 25 (American Mathematical Society).Google Scholar
  • Hardt M, Recht B, Singer Y (2016) Train faster, generalize better: stability of stochastic gradient descent. Internat. Conf. Machine Learn., 1225–1234.Google Scholar
  • Hazan E, Levy KY, Shalev-Shwartz S (2016) On graduated optimization for stochastic non-convex problems. Internat. Conf. Machine Learn., 1833–1841.Google Scholar
  • Hérau F, Nier F (2004) Isotropic hypoellipticity and trend to equilibrium for the Fokker-Planck equation with a high-degree potential. Arch. Rational Mechanics Anal. 171(2):151–218.CrossrefGoogle Scholar
  • Hoerl AE, Kennard RW (1970) Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12(1):55–67.CrossrefGoogle Scholar
  • Holley RA, Kusuoka S, Stroock DW (1989) Asymptotics of the spectral gap with applications to the theory of simulated annealing. J. Functional Anal. 83(2):333–347.CrossrefGoogle Scholar
  • Hu Y, Li C, Meng K, Qin J, Yang X (2017) Group sparse optimization via ℓp,q regularization. J. Machine Learn. Res. 18(1):960–1011.Google Scholar
  • James G, Witten D, Hastie T, Tibshirani R (2013) An Introduction to Statistical Learning, vol. 112 (Springer).CrossrefGoogle Scholar
  • Jofré A, Thompson P (2019) On variance reduction for stochastic smooth convex optimization with multiplicative noise. Math. Programming 174:253–292.CrossrefGoogle Scholar
  • Kilmer ME, O’Leary DP (2001) Choosing regularization parameters in iterative methods for ill-posed problems. SIAM J. Matrix Anal. Appl. 22(4):1204–1221.CrossrefGoogle Scholar
  • Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by simulated annealing. Sci. 220(4598):671–680.CrossrefGoogle Scholar
  • Kramers HA (1940) Brownian motion in a field of force and the diffusion model of chemical reactions. Physica 7(4):284–304.CrossrefGoogle Scholar
  • Lee YT, Vempala SS (2018) Convergence rate of Riemannian Hamiltonian Monte Carlo and faster polytope volume computation. Proc. 50th Annual ACM SIGACT Sympos. Theory Comput. (ACM), 1115–1121.Google Scholar
  • Leimkuhler B, Matthews C, Stoltz G (2015) The computation of averages from equilibrium and nonequilibrium Langevin molecular dynamics. IMA J. Numerical Anal. 36(1):13–79.Google Scholar
  • Liu T, Chen Z, Zhou E, Zhao T (2018) Toward deeper understanding of nonconvex stochastic optimization with momentum using diffusion approximations. Preprint, submitted February 14, https://arxiv.org/abs/1802.05155.Google Scholar
  • Lu H, Freund RM, Mirrokni V (2018) Accelerating greedy coordinate descent methods. Internat. Conf. Machine Learn., 3257–3266.Google Scholar
  • Ma YA, Chen T, Fox E. (2015) A complete recipe for stochastic gradient MCMC. Cortes C, Lawrence N, Lee D, Sugiyama M, GarnettR, eds. Advances in Neural Information Processing Systems, vol. 28 (Curran Associates, Inc.), 2917–2925.Google Scholar
  • Mangoubi O, Smith A (2017) Rapid Mixing of Hamiltonian Monte Carlo on strongly log-concave distributions. Preprint, submitted August 23, https://arxiv.org/abs/1708.07114.Google Scholar
  • Mangoubi O, Pillai NS, Smith A (2018) Does Hamiltonian Monte Carlo mix faster than a random walk on multimodal densities? Preprint, submitted August 9, https://arxiv.org/abs/1808.03230.Google Scholar
  • Mattingly JC, Stuart AM, Higham DJ (2002) Ergodicity for SDEs and approximations: Locally Lipschitz vector fields and degenerate noise. Stochastic Processes Their Appl. 101(2):185–232.CrossrefGoogle Scholar
  • Mei S, Bai Y, Montanari A (2018) The landscape of empirical risk for nonconvex losses. Ann. Statist. 46(6A):2747–2774.CrossrefGoogle Scholar
  • Neal R (2010) MCMC using Hamiltonian dynamics. Brooks S, Gelman A, Jones G, Meng X-L, eds. Handbook of Markov Chain Monte Carlo (Chapman & Hall / CRC press, Boca Raton, Florida), 113–162.Google Scholar
  • Nesterov YE (1983) A method for solving the convex programming problem with convergence rate o(1/k2). Dokl. Akad. Nauk SSSR. 269:543–547.Google Scholar
  • Nguyen T, Sanner S (2013) Algorithms for direct 0–1 loss optimization in binary classification. Internat. Conf. Machine Learn., 1085–1093.Google Scholar
  • O’Neill M, Wright SJ (2019) Behavior of accelerated gradient methods near critical points of nonconvex problems. Math. Programming 176:403–427.CrossrefGoogle Scholar
  • Patterson S, Teh YW (2013) Stochastic gradient Riemannian Langevin dynamics on the probability simplex. Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 26 (Curran Associates, Inc.), 3102–3110.Google Scholar
  • Pavliotis GA (2014) Stochastic Processes and Applications: Diffusion Processes, the Fokker-Planck and Langevin Equations, vol. 60 (Springer).CrossrefGoogle Scholar
  • Polyak BT (1987) Introduction to Optimization (Optimization Software, New York).Google Scholar
  • Raginsky M, Rakhlin A, Telgarsky M (2017) Non-convex learning via stochastic gradient Langevin dynamics: A nonasymptotic analysis. Conf. Learn. Theory, 1674–1703.Google Scholar
  • Razavi SA, Ollila E, Koivunen V (2012) Robust greedy algorithms for compressed sensing. Proc. 20th Eur. Signal Processing Conf. (IEEE), 969–973.Google Scholar
  • Shi B, Du SS, Jordan MI, Su WJ (2018) Understanding the acceleration phenomenon via high-resolution differential equations. Preprint, submitted October 21, https://arxiv.org/abs/1810.08907.Google Scholar
  • Şimşekli U, Badeau R, Cemgil T, Richard G (2016) Stochastic quasi-Newton Langevin Monte Carlo. Internat. Conf. Machine Learn., 642–651.Google Scholar
  • Şimşekli U, Yildiz Ç, Nguyen TH, Cemgil T, Richard G (2018) Asynchronous stochastic quasi-Newton MCMC for non-convex optimization. Dy J, Krause A, eds. Proc. 35th Internat. Conf. Machine Learn., vol. 80 (PMLR), 4674–4683.Google Scholar
  • Su W, Boyd S, Candes E (2014) A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights. Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 27 (Curran Associates, Inc.), 2510–2518.Google Scholar
  • Sun J, Qu Q, Wright J (2016) Complete dictionary recovery over the sphere I: Overview and the geometric picture. IEEE Trans. Inform. Theory 63(2):853–884.CrossrefGoogle Scholar
  • Sutskever I, Martens J, Dahl G, Hinton G (2013) On the importance of initialization and momentum in deep learning. Dasgupta S, McAllester D, eds. Proc. 30th Internat. Conf. Machine Learn., vol. 28 (PMLR, Atlanta), 1139–1147.Google Scholar
  • Teh YW, Thiery AH, Vollmer SJ (2016) Consistency and fluctuations for stochastic gradient Langevin dynamics. J. Machine Learn. Res. 17(1):193–225.Google Scholar
  • Tzen B, Liang T, Raginsky M (2018) Local optimality and generalization guarantees for the Langevin algorithm via empirical metastability. Conf. Learn. Theory, 857–875.Google Scholar
  • Villani C (2008) Optimal Transport: Old and New, vol. 338 (Springer Science & Business Media, Springer-Verlag Berlin Heidelberg).Google Scholar
  • Wang D, Chen C, Xu J (2019) Differentially private empirical risk minimization with non-convex loss functions. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learn., vol 97 (PMLR), 6526–6535.Google Scholar
  • Wang X, Jiang Y, Huang M, Zhang H (2013) Robust variable selection with exponential squared loss. J. Amer. Statist. Assoc. 108(502):632–643.CrossrefGoogle Scholar
  • Welling M, Teh YW (2011) Bayesian learning via stochastic gradient Langevin dynamics. Proc. 28th Internat. Conf. Machine Learn. (Citeseer), 681–688.Google Scholar
  • Wibisono A (2018) Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem. Bubeck S, Perchet V, Rigollet P, eds. Proc. 31st Annual Conf. Learn. Theory, vol. 75 (PMLR), 2093–3027.Google Scholar
  • Wilson AC, Recht B, Jordan MI (2016) A Lyapunov analysis of momentum methods in optimization. Preprint, submitted November 8, https://arxiv.org/abs/1611.02635.Google Scholar
  • Wu Y, Liu Y (2007) Robust truncated hinge loss support vector machines. J. Amer. Statist. Assoc. 102(479):974–983.CrossrefGoogle Scholar
  • Xu P, Chen J, Zou D, Gu Q (2018) Global convergence of Langevin dynamics based algorithms for nonconvex optimization. Bengio S, Wallach H, Larochelle H, Grauman K. Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inform. Processing Systems (Curran Associates, Inc.) 31:3122–3133.Google Scholar
  • Zhang Y, Liang P, Charikar M (2017a) A hitting time analysis of stochastic gradient Langevin dynamics. Kale S, Shamir O, eds. Proc. 2017 Conf. Learning Theory, vol. 65 (PMLR), 1980–2022.Google Scholar
  • Zhang H, Zhou Y, Liang Y, Chi Y (2017b) A nonconvex approach for phase retrieval: Reshaped Wirtinger flow and incremental algorithms. J. Machine Learn. Res. 18(1):5164–5198.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.