Convergence Analysis of Accelerated Stochastic Gradient Descent Under the Growth Condition

Published Online:https://doi.org/10.1287/moor.2021.0293

References

  • [1] Assran M, Rabbat M (2020) On the convergence of Nesterov’s accelerated gradient method in stochastic settings. Proc. Machine Learning Res. 119:410–420.Google Scholar
  • [2] Aybat NS, Fallah A, Gurbuzbalaban M, Ozdaglar A (2019) A universally optimal multistage accelerated stochastic gradient method. Adv. Neural Inform. Processing Systems 32.Google Scholar
  • [3] Aybat NS, Fallah A, Gürbüzbalaban M, Ozdaglar A (2020) Robust accelerated gradient methods for smooth strongly convex functions. SIAM J. Optim. 30(1):717–751.CrossrefGoogle Scholar
  • [4] Bauschke HG, Borwein JM (1997) Legendre functions and the method of random Bregman projections. J. Convex Anal. 4(1):27–67.Google Scholar
  • [5] Bottou L, Curtis FE, Nocedal J (2018) Optimization methods for large-scale machine learning. SIAM Rev. 60(2):223–311.CrossrefGoogle Scholar
  • [6] Bubeck S (2015) Convex optimization: Algorithms and complexity. Foundations Trends Machine Learning 8(3–4):231–357.CrossrefGoogle Scholar
  • [7] Cevher V, Vũ BC (2018) On the linear convergence of the stochastic gradient method with constant step-size. Optim. Lett. 13(5):1177–1187.CrossrefGoogle Scholar
  • [8] Chen X, Lin Q, Pena J (2012) Optimal regularized dual averaging methods for stochastic optimization. Adv. Neural Inform. Processing Systems 25.Google Scholar
  • [9] Cohen M, Diakonikolas J, Orecchia L (2018) On acceleration with noise-corrupted gradients. Proc. Machine Learning Res. 80:1019–1028.Google Scholar
  • [10] Cyrus S, Hu B, Scoy BV, Lessard L (2018) A robust accelerated optimization algorithm for strongly convex functions. 2018 Annu. Amer. Control Conf. ACC (IEEE, Piscataway, NJ), 1376–1381.Google Scholar
  • [11] Diakonikolas J, Orecchia L (2018) Accelerated extra-gradient descent: A novel accelerated first-order method. 9th Innovations Theor. Comput. Sci. Conf. ITCS 2018 (Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik GmbH, Wadern/Saarbruecken, Germany), 23:1–23:19.Google Scholar
  • [12] Diakonikolas J, Orecchia L (2019) The approximate duality gap technique: A unified theory of first-order methods. SIAM J. Optim. 29(1):660–689.CrossrefGoogle Scholar
  • [13] Ghadimi S, Lan G (2013) Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization. II: Shrinking procedures and optimal algorithms. SIAM J. Optim. 23(4):2061–2089.CrossrefGoogle Scholar
  • [14] Gower RM, Loizou N, Qian X, Sailanbayev A, Shulgin E, Richtárik P (2019) SGD: General analysis and improved rates. Proc. Machine Learning Res. 97:5200–5209.Google Scholar
  • [15] Grimmer B (2019) Convergence rates for deterministic and stochastic subgradient methods without Lipschitz continuity. SIAM J. Optim. 29(2):1350–1365.CrossrefGoogle Scholar
  • [16] Gürbüzbalaban M, Ozdaglar A, Parrilo P (2015) A globally convergent incremental Newton method. Math. Program. 151(1):283–313.CrossrefGoogle Scholar
  • [17] Hu B, Lessard L (2017) Dissipativity theory for Nesterov’s accelerated method. Proc. Machine Learning Res. 70:1549–1557.Google Scholar
  • [18] Hu B, Seiler P, Lessard L (2020) Analysis of biased stochastic gradient descent using sequential semidefinite programs. Math. Program. 187(1–2):383–408.CrossrefGoogle Scholar
  • [19] Hu B, Wright S, Lessard L (2018) Dissipativity theory for accelerating stochastic variance reduction: A unified analysis of SVRG and Katyusha using semidefinite programs. Proc. Machine Learning Res. 80:2038–2047.Google Scholar
  • [20] Jain P, Kakade SM, Kidambi R, Netrapalli P, Sidford A (2018) Accelerating stochastic gradient descent for least squares regression. Proc. Machine Learning Res. 75:545–604.Google Scholar
  • [21] Jain P, Kakade SM, Kidambi R, Netrapalli P, Sidford A (2018) Parallelizing stochastic gradient descent for least squares regression: Mini-batching, averaging, and model misspecification. J. Machine Learning Res. 18(223):1–42.Google Scholar
  • [22] Jofré A, Thompson P (2018) On variance reduction for stochastic smooth convex optimization with multiplicative noise. Math. Program. 174(1–2):253–292.CrossrefGoogle Scholar
  • [23] Kidambi R, Netrapalli P, Jain P, Kakade S (2018) On the insufficiency of existing momentum schemes for stochastic optimization. Inform. Theory Appl. Workshop ITA (IEEE, Piscataway, NJ), 1–9.Google Scholar
  • [24] Lessard L, Recht B, Packard A (2016) Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1):57–95.CrossrefGoogle Scholar
  • [25] Liu C, Belkin M (2020) Accelerating SGD with momentum for over-parameterized learning. Internat. Conf. Learning Representations. Preprint, submitted October 31, https://arxiv.org/abs/1810.13395.Google Scholar
  • [26] Mourtada J (2022) Exact minimax risk for linear least squares, and the lower tail of sample covariance matrices. Ann. Statist. 50(4):2157–2178.CrossrefGoogle Scholar
  • [27] Nemirovskij AS, Yudin DB (1983) Problem Complexity and Method Efficiency in Optimization (Wiley-Interscience, Chichester, UK).Google Scholar
  • [28] Nesterov YE (1983) A method for solving the convex programming problem with convergence rate o(1/k2). Dokl. Akad. Nauk SSSR 269:543–547.Google Scholar
  • [29] Nesterov Y (2004) Introductory Lectures on Convex Optimization, Applied Optimization, vol. 87 (Springer US, New York).Google Scholar
  • [30] Nguyen LM, Nguyen PH, Richtárik P, Scheinberg K, Takác M, van Dijk M (2019) New convergence aspects of stochastic gradient algorithms. J. Machine Learning Res. 20:176:1–176:49.Google Scholar
  • [31] Polyak B (1964) Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5):1–17.CrossrefGoogle Scholar
  • [32] Rakhlin A, Shamir O, Sridharan K (2012) Making gradient descent optimal for strongly convex stochastic optimization. Proc. 29th Internat. Conf. Machine Learning, 1571–1578.Google Scholar
  • [33] Schmidt M, Roux NL (2013) Fast convergence of stochastic gradient descent under a strong growth condition. Preprint, submitted August 29, https://arxiv.org/abs/1308.6370.Google Scholar
  • [34] Scoy BV, Freeman RA, Lynch KM (2018) The fastest known globally convergent first-order method for minimizing strongly convex functions. IEEE Control Systems Lett. 2(1):49–54.CrossrefGoogle Scholar
  • [35] Solodov M (1998) Incremental gradient algorithms with stepsizes bounded away from zero. Comput. Optim. Appl. 11(1):23–35.CrossrefGoogle Scholar
  • [36] Stich SU (2019) Unified optimal analysis of the (stochastic) gradient method. Preprint, submitted July 9, https://arxiv.org/abs/1907.04232.Google Scholar
  • [37] Tseng P (1998) An incremental gradient(-projection) method with momentum term and adaptive stepsize rule. SIAM J. Optim. 8(2):506–531.CrossrefGoogle Scholar
  • [38] Vaswani S, Bach FR, Schmidt M (2019) Fast and faster convergence of SGD for over-parameterized models and an accelerated perceptron. Proc. Machine Learning Res. 89:1195–1204.Google Scholar
  • [39] Xiao L (2010) Dual averaging methods for regularized stochastic learning and online optimization. J. Machine Learning Res. 11:2543–2596.Google Scholar
  • [40] Xu P, Wang T, Gu Q (2018) Continuous and discrete-time accelerated stochastic mirror descent for strongly convex functions. Proc. Machine Learning Res. 80:5492–5501.Google Scholar
  • [41] Zhou X (2018) On the Fenchel duality between strong convexity and Lipschitz continuous gradient. Preprint, submitted March 17, https://arxiv.org/abs/1803.06573.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.