Variance-Reduced Accelerated First-Order Methods: Central Limit Theorems and Confidence Statements

Jinlong Lei
Jinlong Lei
[email protected]
https://orcid.org/0000-0001-5426-0797
Department of Control Science and Engineering, Tongji University, Shanghai 201804, China; and Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, Shanghai 201804, China
Search for more papers by this author
,
Uday V. Shanbhag
Corresponding Author
Uday V. Shanbhag
[email protected]
https://orcid.org/0000-0001-5869-9561
Department of Industrial and Manufacturing Engineering, Pennsylvania State University, University Park, Pennsylvania 16802
Search for more papers by this author

Department of Control Science and Engineering, Tongji University, Shanghai 201804, China; and Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, Shanghai 201804, China

Search for more papers by this author

Uday V. Shanbhag

Corresponding Author

Uday V. Shanbhag

[email protected]

https://orcid.org/0000-0001-5869-9561

Department of Industrial and Manufacturing Engineering, Pennsylvania State University, University Park, Pennsylvania 16802

Search for more papers by this author

Published Online:4 Jun 2024https://doi.org/10.1287/moor.2021.0068

References

[1] Beck A (2017) First-Order Methods in Optimization (SIAM, Philadelphia).Crossref, Google Scholar
[2] Bertsekas D, Tsitsiklis J (1989) Parallel and Distributed Computation: Numerical Methods, International ed. (Prentice-Hall International, Hoboken, NJ).Google Scholar
[3] Bickel P, Choi D, Chang X, Zhang H (2013) Asymptotic normality of maximum likelihood and its variational approximation for stochastic blockmodels. Ann. Statist. 41(4):1922–1943.Crossref, Google Scholar
[4] Birge JR, Louveaux F (1997) Introduction to Stochastic Programming, Springer Series in Operations Research (Springer, Berlin, Heidelberg).Google Scholar
[5] Blum JR (1954) Multidimensional stochastic approximation methods. Ann. Math. Statist. 25(4):737–744.Crossref, Google Scholar
[6] Bodnar T, Mazur S, Podgórski K (2016) Singular inverse Wishart distribution and its application to portfolio theory. J. Multivariate Anal. 143(1):314–326.Crossref, Google Scholar
[7] Borkar VS (2009) Stochastic Approximation: A Dynamical Systems Viewpoint, vol. 48 (Springer, Cham, Switzerland).Google Scholar
[8] Bottou L (1998) On-line learning and stochastic approximations. Saad D, ed. On-Line Learning in Neural Networks (Cambridge University Press, Cambridge, UK), 9–42.Google Scholar
[9] Byrd RH, Chin GM, Nocedal J, Wu Y (2012) Sample size selection in optimization methods for machine learning. Math. Programming 134(1):127–155.Crossref, Google Scholar
[10] Chee J, Toulis P (2018) Convergence diagnostics for stochastic gradient descent with constant learning rate. 21st Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 1476–1485.Google Scholar
[11] Chen HF (2006) Stochastic Approximation and Its Applications, vol. 64 (Springer Science & Business Media, New York).Google Scholar
[12] Chen X, Lee JD, Tong XT, Zhang Y (2020) Statistical inference for model parameters in stochastic gradient descent. Ann. Statist. 48(1):251–273.Crossref, Google Scholar
[13] DasGupta A (2008) Asymptotic Theory of Statistics and Probability (Springer Science & Business Media, New York).Google Scholar
[14] Defazio A, Bach F, Lacoste-Julien S (2014) Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ, eds. Proc. 27th Internat. Conf. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 1646–1654.Google Scholar
[15] DeGroot MH, Schervish MJ (2012) Probability and Statistics (Pearson Education, London).Google Scholar
[16] Deng G, Ferris MC (2009) Variable-number sample-path optimization. Math. Programming 117(1–2):81–109.Crossref, Google Scholar
[17] Dieuleveut A, Durmus A, Bach F (2020) Bridging the gap between constant step size stochastic gradient descent and Markov chains. Ann. Statist. 48(3):1348–1382.Crossref, Google Scholar
[18] Fabian V (1968) On asymptotic normality in stochastic approximation. Ann. Math. Statist. 39(4):1327–1332.Crossref, Google Scholar
[19] Friedlander MP, Schmidt M (2012) Hybrid deterministic-stochastic methods for data fitting. SIAM J. Sci. Comput. 34(3):A1380–A1405.Crossref, Google Scholar
[20] Gadat S, Panloup F, Saadane S (2018) Stochastic heavy ball. Electronic J. Statist. 12(1):461–529.Crossref, Google Scholar
[21] Ghadimi S, Lan G (2016) Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Programming 156(1):59–99.Crossref, Google Scholar
[22] Ghadimi E, Feyzmahdavian HR, Johansson M (2015) Global convergence of the heavy-ball method for convex optimization. 2015 Eur. Control Conf. (ECC) (IEEE, Piscataway, NJ), 310–315.Google Scholar
[23] Goujaud B, Taylor A, Dieuleveut A (2023) Provable non-accelerations of the heavy-ball method. Preprint, submitted July 21, https://arxiv.org/abs/2307.11291.Google Scholar
[24] Hall P, Heyde CC (2014) Martingale Limit Theory and Its Application (Academic Press, Cambridge, MA).Google Scholar
[25] Hasminskii RZ, Silver B (1973) Stochastic Approximation and Recursive Estimation, vol. 47 (American Mathematical Society, Providence, RI).Google Scholar
[26] Homem-de-Mello T (2003) Variable-sample methods for stochastic optimization. ACM Trans. Model. Comput. Simulation 13(2):108–133.Crossref, Google Scholar
[27] Hsieh MH, Glynn PW (2002) Recent advances in simulation optimization: Confidence regions for stochastic approximation algorithms. Proc. 34th Conf. Winter Simulation Exploring New Frontiers (Winter Simulation Conference, San Diego), 370–376.Google Scholar
[28] Jalilzadeh A, Shanbhag U, Blanchet J, Glynn PW (2022) Smoothed variable sample-size accelerated proximal methods for nonsmooth stochastic convex programs. Stochastic Systems 12(4):373–410.Link, Google Scholar
[29] Jofré A, Thompson P (2019) On variance reduction for stochastic smooth convex optimization with multiplicative noise. Math. Programming 174(1–2):253–292.Crossref, Google Scholar
[30] Johnson R, Zhang T (2013) Accelerating stochastic gradient descent using predictive variance reduction. Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, eds. Proc. 26th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 315–323.Google Scholar
[31] Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Pereira F, Burges CJC, Bottou L, Weinberger KQ, eds. Proc. 25th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 1097–1105.Google Scholar
[32] Kushner H, Yin GG (2003) Stochastic Approximation and Recursive Algorithms and Applications, vol. 35 (Springer Science & Business Media, New York).Google Scholar
[33] Lamm M, Lu S, Budhiraja A (2017) Individual confidence intervals for solutions to expected value formulations of stochastic variational inequalities. Math. Programming 165(1):151–196.Crossref, Google Scholar
[34] Lei J, Shanbhag UV (2022) Asynchronous variance-reduced block schemes for composite non-convex stochastic optimization: Block-specific steplengths and adapted batch-sizes. Optim. Methods Software 37(1):264–294.Crossref, Google Scholar
[35] Lei J, Shanbhag UV (2022) Distributed variable sample-size gradient-response and best-response schemes for stochastic Nash equilibrium problems. SIAM J. Optim. 32(2):573–603.Crossref, Google Scholar
[36] Lin PE (1972) Some characterizations of the multivariate t distribution. J. Multivariate Anal. 2(3):339–344.Crossref, Google Scholar
[37] Lu S (2014) A new method to build confidence regions for solutions of stochastic variational inequalities. Optimization 63(9):1431–1443.Crossref, Google Scholar
[38] Mandt S, Hoffman MD, Blei DM (2017) Stochastic gradient descent as approximate Bayesian inference. J. Machine Learn. Res. 18(1):4873–4907.Google Scholar
[39] Meyn SP, Tweedie RL (2012) Markov Chains and Stochastic Stability (Springer Science & Business Media, New York).Google Scholar
[40] Nemirovskiĭ A, Yudin D (1983) Problem Complexity and Method Efficiency in Optimization (Wiley, New York).Google Scholar
[41] Nemirovski A, Juditsky A, Lan G, Shapiro A (2009) Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4):1574–1609.Crossref, Google Scholar
[42] Nesterov YE (1983) A method for solving the convex programming problem with convergence rate O(1/k2). Doklady Akademii nauk SSSR 269:543–547.Google Scholar
[43] Nesterov Y (2013) Introductory Lectures on Convex Optimization: A Basic Course, vol. 87 (Springer Science & Business Media, New York).Google Scholar
[44] Nocedal J, Wright SJ (1999) Numerical Optimization, Springer Series in Operations Research (Springer-Verlag, New York).Crossref, Google Scholar
[45] Pasupathy R (2010) On choosing parameters in retrospective-approximation algorithms for stochastic root finding and simulation optimization. Oper. Res. 58(4 Part 1):889–901.Link, Google Scholar
[46] Pasupathy R, Kim S (2011) The stochastic root-finding problem: Overview, solutions, and open questions. ACM Trans. Model. Comput. Simulation 21(3):19.Crossref, Google Scholar
[47] Pasupathy R, Glynn PW, Ghosh S, Hashemi FS (2018) On sampling rates in simulation-based recursions. SIAM J. Optim. 28(1):45–73.Crossref, Google Scholar
[48] Polyak BT (1964) Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5):1–17.Crossref, Google Scholar
[49] Polyak BT (1987) Introduction to Optimization (Optimization Software Publications Division, New York).Google Scholar
[50] Polyak BT, Juditsky AB (1992) Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30(4):838–855.Crossref, Google Scholar
[51] Robbins H, Monro S (1951) A stochastic approximation method. Ann. Math. Statist. 22(3):400–407.Crossref, Google Scholar
[52] Robbins H, Siegmund D (1971) A convergence theorem for non negative almost supermartingales and some applications. Rustagi JS, ed. Optimizing Methods in Statistics (Elsevier, Amsterdam), 233–257.Google Scholar
[53] Schmidt M, Roux NL, Bach FR Convergence rates of inexact proximal-gradient methods for convex optimization. Shawe-Taylor J, Zemel RS, Bartlett PL, Pereira F, Weinberger KQ, eds. Proc. 24th Internat. Conf. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 1458–1466.Google Scholar
[54] Shanbhag UV, Blanchet JH (2015) Budget-constrained stochastic approximation. Proc. 2015 Winter Simulation Conf. (IEEE Press, Piscataway, NJ), 368–379.Google Scholar
[55] Shapiro A, Dentcheva D, Ruszczyński A (2014) Lectures on Stochastic Programming: Modeling and Theory, 2nd ed. (SIAM, Philadelphia).Crossref, Google Scholar
[56] Silvester JR (2000) Determinants of block matrices. Math. Gazette 84(501):460–467.Crossref, Google Scholar
[57] Spall JC (2000) Adaptive stochastic approximation by the simultaneous perturbation method. IEEE Trans. Automatic Control 45(10):1839–1853.Crossref, Google Scholar
[58] Su WJ, Zhu Y (2018) Uncertainty quantification for online learning and stochastic approximation via hierarchical incremental gradient descent. Preprint, submitted February 13, https://arxiv.org/abs/1802.04876.Google Scholar
[59] Sutskever I, Martens J, Dahl G, Hinton G (2013) On the importance of initialization and momentum in deep learning. Dasgupta S, ed. 30th Internat. Conf. Machine Learn. (ICML, San Diego), 1139–1147.Google Scholar
[60] Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. Proc. 2015 IEEE Conf. Comput. Vision Pattern Recognition (CVPR) (IEEE Press, Piscataway, NJ), 1–9.Google Scholar
[61] Toulis P, Airoldi E, Rennie J (2014) Statistical analysis of stochastic gradient methods for generalized linear models. Xing EP, ed. 31th Internat. Conf. Machine Learn. (ICML, San Diego), 667–675.Google Scholar
[62] Van der Vaart AW (2000) Asymptotic Statistics, vol. 3 (Cambridge University Press, Cambridge, UK).Google Scholar
[63] Venter J (1967) An extension of the Robbins-Monro procedure. Ann. Math. Statist. 38(1):181–190.Crossref, Google Scholar
[64] Wei C (1987) Multivariate adaptive stochastic approximation. Ann. Statist. 15(3):1115–1130.Crossref, Google Scholar
[65] Yang T, Lin Q, Li Z (2016) Unified convergence analysis of stochastic momentum methods for convex and non-convex optimization. Preprint, submitted April 12, https://arxiv.org/abs/1604.03257.Google Scholar
[66] Yousefian F, Nedic A, Shanbhag UV (2012) On stochastic gradient and subgradient methods with adaptive steplength sequences. Automatica J. IFAC 48(1):56–67.Crossref, Google Scholar
[67] Zhu Y, Dong J (2021) On constructing confidence region for model parameters in stochastic gradient descent via batch means. Proc. 2021 Winter Simulation Conf. (IEEE Press, Piscataway, NJ).Google Scholar

cover image Mathematics of Operations Research

Volume 50, Issue 2

May 2025

Pages iii, 783-1583

Article Information

Supplemental Material

Metrics

Information

Received:March 17, 2021
Accepted:September 10, 2023
Published Online:June 04, 2024

Cite as

Jinlong Lei; , Uday V. Shanbhag (2024) Variance-Reduced Accelerated First-Order Methods: Central Limit Theorems and Confidence Statements. Mathematics of Operations Research 50(2):1364-1397.

https://doi.org/10.1287/moor.2021.0068

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Variance-Reduced Accelerated First-Order Methods: Central Limit Theorems and Confidence Statements

References

Volume 50, Issue 2

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News