Global Convergence of Stochastic Gradient Hamiltonian Monte Carlo for Nonconvex Stochastic Optimization: Nonasymptotic Performance Bounds and Momentum-Based Acceleration
References
- (2012) Bayesian posterior sampling via stochastic gradient Fisher scoring. Internat. Conf. Machine Learn., 1771–1778.Google Scholar
- (2016) Variance reduction for faster non-convex optimization. Balcan MF, Weinberger KQ, eds. Proc. 33rd Internat. Conf. Machine Learn., vol. 48 (PMLR, New York), 699–707.Google Scholar
- (2019) Sorting out Lipschitz function approximation. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learn., vol. 97 (PMLR, New York), 291–301.Google Scholar
- (2015) Escaping the local minima via simulated annealing: Optimization of approximately convex functions. Conf. Learn. Theory, 240–265.Google Scholar
- (1993) Simulated annealing. Statist. Sci. 8(1):10–15.Crossref, Google Scholar
- (2017) A conceptual introduction to Hamiltonian Monte Carlo. Preprint, submitted January 10, https://arxiv.org/abs/1701.02434.Google Scholar
- (2014) Optimizing the integrator step size for Hamiltonian Monte Carlo. Preprint, submitted November 24, https://arxiv.org/abs/1411.6669.Google Scholar
- (2017) The geometric foundations of Hamiltonian Monte Carlo. Bernoulli 23(4A):2257–2298.Crossref, Google Scholar
- (2005) Weighted Csiszár-Kullback-Pinsker inequalities and applications to transportation inequalities. Annales de la Faculté des Sciences de Toulouse: Mathématiques 14(3):331–352.Crossref, Google Scholar
- (1999) A strong approximation theorem for stochastic recursive algorithms. J. Optim. Theory Appl. 100(3):499–513.Crossref, Google Scholar
- (2005) Metastability in reversible diffusion processes II: Precise asymptotics for small eigenvalues. J. Eur. Math. Soc. 7(1):69–99.Crossref, Google Scholar
- (2018) Accelerated methods for nonconvex optimization. SIAM J. Optim. 28(2):1751–1772.Crossref, Google Scholar
- (2009) Tighter bounds for structured estimation. Koller D, Schuurmans D, Bengio Y, Bottou L, eds. Advances in Neural Information Processing Systems, vol. 21 (Curran Associates, Inc.), 281–288.Google Scholar
- (2018) On the theory of variance reduction for stochastic gradient Monte Carlo. Internat. Conf. Machine Learn., 764–773.Google Scholar
- (2017) Entropy-SGD: Biasing gradient descent into wide valleys. Internat. Conf. Learn. Representations.Google Scholar
- (2015) On the convergence of stochastic gradient MCMC algorithms with high-order integrators. Proc. 28th Internat. Conf. Neural Inform. Processing Systems, vol. 2, 2278–2286.Google Scholar
- (2014) Stochastic gradient Hamiltonian Monte Carlo. Xing EP, Jebara T, eds. Proc. 31st Internat. Conf. Machine Learn., vol. 32 (PMLR, Beijing), 1683–1691.Google Scholar
- (2016) Bridging the gap between stochastic gradient MCMC and stochastic optimization. Proc. 19th Internat. Conf. Artificial Intelligence Statist., 1051–1060.Google Scholar
- (2018a) Underdamped Langevin MCMC: A non-asymptotic analysis. Proc. 31st Annual Conf. Learn. Theory.Google Scholar
- (2018b) Sharp convergence rates for Langevin dynamics in the nonconvex setting. Preprint, submitted May 4, https://arxiv.org/abs/1805.01648.Google Scholar
- (1987) Diffusion for global optimization in Rn. SIAM J. Control Optim. 25(3):737–753.Crossref, Google Scholar
- (2006) Trading convexity for scalability. Proc. 23rd Internat. Conf. Machine Learn., 201–208.Google Scholar
- (2017) Theoretical guarantees for approximate sampling from smooth and log-concave densities. J. Roy. Statist. Soc. Ser. B 79(3):651–676.Crossref, Google Scholar
- (2019) User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. Stochastic Processes Their Appl. 129(12):5278–5311.Crossref, Google Scholar
- (2020) On sampling from a log-concave density using kinetic Langevin diffusions. Bernoulli 26(3):1956–1988.Crossref, Google Scholar
- (2019) Bounding the error of discretized Langevin algorithms for non-strongly log-concave targets. Preprint, submitted June 20, https://arxiv.org/abs/1906.08530.Google Scholar
- (2014) Bayesian sampling using stochastic gradient thermostats. Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 27 (Curran Associates, Inc.), 3203–3211.Google Scholar
- (2018) Gradient descent learns one-hidden-layer CNN: Don’t be afraid of spurious local minima. Internat. Conf. Machine Learn., 1339–1348.Google Scholar
- (1987) Hybrid Monte Carlo. Phys. Lett. B. 195(2):216–222.Crossref, Google Scholar
- (2019) Couplings and quantitative contraction rates for Langevin dynamics. Ann. Probab. 47(4):1982–2010.Crossref, Google Scholar
- (2018) Uniform convergence of gradients for non-convex learning and optimization. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inform. Processing Systems (Curran Associates, Inc.) 31:8745–8756.Google Scholar
- (2018) Learning one-hidden-layer neural networks with landscape design. Internat. Conf. Learn. Representations.Google Scholar
- (1991) Recursive stochastic algorithms for global optimization in Rd. SIAM J. Control Optim. 29(5):999–1018.Crossref, Google Scholar
- (2016) Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Programming 156(1):59–99.Crossref, Google Scholar
- (1985) Nonstationary Markov chains and convergence of the annealing algorithm. J. Statist. Phys. 39(1–2):73–131.Crossref, Google Scholar
- (1985) A tutorial survey of theory and applications of simulated annealing. 24th IEEE Conf. Decision Control (IEEE), 755–760.Crossref, Google Scholar
- (1988) Asymptotic Behavior of Dissipative Systems, vol. 25 (American Mathematical Society).Google Scholar
- (2016) Train faster, generalize better: stability of stochastic gradient descent. Internat. Conf. Machine Learn., 1225–1234.Google Scholar
- (2016) On graduated optimization for stochastic non-convex problems. Internat. Conf. Machine Learn., 1833–1841.Google Scholar
- (2004) Isotropic hypoellipticity and trend to equilibrium for the Fokker-Planck equation with a high-degree potential. Arch. Rational Mechanics Anal. 171(2):151–218.Crossref, Google Scholar
- (1970) Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12(1):55–67.Crossref, Google Scholar
- (1989) Asymptotics of the spectral gap with applications to the theory of simulated annealing. J. Functional Anal. 83(2):333–347.Crossref, Google Scholar
- (2017) Group sparse optimization via ℓp,q regularization. J. Machine Learn. Res. 18(1):960–1011.Google Scholar
- (2013) An Introduction to Statistical Learning, vol. 112 (Springer).Crossref, Google Scholar
- (2019) On variance reduction for stochastic smooth convex optimization with multiplicative noise. Math. Programming 174:253–292.Crossref, Google Scholar
- (2001) Choosing regularization parameters in iterative methods for ill-posed problems. SIAM J. Matrix Anal. Appl. 22(4):1204–1221.Crossref, Google Scholar
- (1983) Optimization by simulated annealing. Sci. 220(4598):671–680.Crossref, Google Scholar
- (1940) Brownian motion in a field of force and the diffusion model of chemical reactions. Physica 7(4):284–304.Crossref, Google Scholar
- (2018) Convergence rate of Riemannian Hamiltonian Monte Carlo and faster polytope volume computation. Proc. 50th Annual ACM SIGACT Sympos. Theory Comput. (ACM), 1115–1121.Google Scholar
- (2015) The computation of averages from equilibrium and nonequilibrium Langevin molecular dynamics. IMA J. Numerical Anal. 36(1):13–79.Google Scholar
- (2018) Toward deeper understanding of nonconvex stochastic optimization with momentum using diffusion approximations. Preprint, submitted February 14, https://arxiv.org/abs/1802.05155.Google Scholar
- (2018) Accelerating greedy coordinate descent methods. Internat. Conf. Machine Learn., 3257–3266.Google Scholar
- (2015) A complete recipe for stochastic gradient MCMC. Cortes C, Lawrence N, Lee D, Sugiyama M, GarnettR, eds. Advances in Neural Information Processing Systems, vol. 28 (Curran Associates, Inc.), 2917–2925.Google Scholar
- (2017) Rapid Mixing of Hamiltonian Monte Carlo on strongly log-concave distributions. Preprint, submitted August 23, https://arxiv.org/abs/1708.07114.Google Scholar
- (2018) Does Hamiltonian Monte Carlo mix faster than a random walk on multimodal densities? Preprint, submitted August 9, https://arxiv.org/abs/1808.03230.Google Scholar
- (2002) Ergodicity for SDEs and approximations: Locally Lipschitz vector fields and degenerate noise. Stochastic Processes Their Appl. 101(2):185–232.Crossref, Google Scholar
- (2018) The landscape of empirical risk for nonconvex losses. Ann. Statist. 46(6A):2747–2774.Crossref, Google Scholar
- (2010) MCMC using Hamiltonian dynamics. Brooks S, Gelman A, Jones G, Meng X-L, eds. Handbook of Markov Chain Monte Carlo (Chapman & Hall / CRC press, Boca Raton, Florida), 113–162.Google Scholar
- (1983) A method for solving the convex programming problem with convergence rate o(1/k2). Dokl. Akad. Nauk SSSR. 269:543–547.Google Scholar
- (2013) Algorithms for direct 0–1 loss optimization in binary classification. Internat. Conf. Machine Learn., 1085–1093.Google Scholar
- (2019) Behavior of accelerated gradient methods near critical points of nonconvex problems. Math. Programming 176:403–427.Crossref, Google Scholar
- (2013) Stochastic gradient Riemannian Langevin dynamics on the probability simplex. Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 26 (Curran Associates, Inc.), 3102–3110.Google Scholar
- (2014) Stochastic Processes and Applications: Diffusion Processes, the Fokker-Planck and Langevin Equations, vol. 60 (Springer).Crossref, Google Scholar
- (1987) Introduction to Optimization (Optimization Software, New York).Google Scholar
- (2017) Non-convex learning via stochastic gradient Langevin dynamics: A nonasymptotic analysis. Conf. Learn. Theory, 1674–1703.Google Scholar
- (2012) Robust greedy algorithms for compressed sensing. Proc. 20th Eur. Signal Processing Conf. (IEEE), 969–973.Google Scholar
- (2018) Understanding the acceleration phenomenon via high-resolution differential equations. Preprint, submitted October 21, https://arxiv.org/abs/1810.08907.Google Scholar
- (2016) Stochastic quasi-Newton Langevin Monte Carlo. Internat. Conf. Machine Learn., 642–651.Google Scholar
- (2018) Asynchronous stochastic quasi-Newton MCMC for non-convex optimization. Dy J, Krause A, eds. Proc. 35th Internat. Conf. Machine Learn., vol. 80 (PMLR), 4674–4683.Google Scholar
- (2014) A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights. Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 27 (Curran Associates, Inc.), 2510–2518.Google Scholar
- (2016) Complete dictionary recovery over the sphere I: Overview and the geometric picture. IEEE Trans. Inform. Theory 63(2):853–884.Crossref, Google Scholar
- (2013) On the importance of initialization and momentum in deep learning. Dasgupta S, McAllester D, eds. Proc. 30th Internat. Conf. Machine Learn., vol. 28 (PMLR, Atlanta), 1139–1147.Google Scholar
- (2016) Consistency and fluctuations for stochastic gradient Langevin dynamics. J. Machine Learn. Res. 17(1):193–225.Google Scholar
- (2018) Local optimality and generalization guarantees for the Langevin algorithm via empirical metastability. Conf. Learn. Theory, 857–875.Google Scholar
- (2008) Optimal Transport: Old and New, vol. 338 (Springer Science & Business Media, Springer-Verlag Berlin Heidelberg).Google Scholar
- (2019) Differentially private empirical risk minimization with non-convex loss functions. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learn., vol 97 (PMLR), 6526–6535.Google Scholar
- (2013) Robust variable selection with exponential squared loss. J. Amer. Statist. Assoc. 108(502):632–643.Crossref, Google Scholar
- (2011) Bayesian learning via stochastic gradient Langevin dynamics. Proc. 28th Internat. Conf. Machine Learn. (Citeseer), 681–688.Google Scholar
- (2018) Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem. Bubeck S, Perchet V, Rigollet P, eds. Proc. 31st Annual Conf. Learn. Theory, vol. 75 (PMLR), 2093–3027.Google Scholar
- (2016) A Lyapunov analysis of momentum methods in optimization. Preprint, submitted November 8, https://arxiv.org/abs/1611.02635.Google Scholar
- (2007) Robust truncated hinge loss support vector machines. J. Amer. Statist. Assoc. 102(479):974–983.Crossref, Google Scholar
- (2018) Global convergence of Langevin dynamics based algorithms for nonconvex optimization. Bengio S, Wallach H, Larochelle H, Grauman K. Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inform. Processing Systems (Curran Associates, Inc.) 31:3122–3133.Google Scholar
- (2017a) A hitting time analysis of stochastic gradient Langevin dynamics. Kale S, Shamir O, eds. Proc. 2017 Conf. Learning Theory, vol. 65 (PMLR), 1980–2022.Google Scholar
- (2017b) A nonconvex approach for phase retrieval: Reshaped Wirtinger flow and incremental algorithms. J. Machine Learn. Res. 18(1):5164–5198.Google Scholar

