Graphical Convergence of Subgradients in Nonconvex Optimization and Learning

Damek Davis
Damek Davis
[email protected]
https://orcid.org/0000-0003-2105-4641
School of Operations Research and Information Engineering, Cornell University, Ithaca, New York 14850
Search for more papers by this author
,
Dmitriy Drusvyatskiy
Dmitriy Drusvyatskiy
[email protected]
Department of Mathematics, University of Washington, Seattle, Washington 98195
Search for more papers by this author

School of Operations Research and Information Engineering, Cornell University, Ithaca, New York 14850

Search for more papers by this author

Dmitriy Drusvyatskiy

[email protected]

Department of Mathematics, University of Washington, Seattle, Washington 98195

Search for more papers by this author

Published Online:20 Apr 2021https://doi.org/10.1287/moor.2021.1126

References

[1] Albano P, Cannarsa P (1999) Singularities of semiconcave functions in Banach spaces.McEneaney WM, Yin GG, Zhang Q, eds. Stochastic analysis, control, optimization and application. Systems & Control: Foundations & Applications (Birkhäuser, Boston), 171–190.Crossref, Google Scholar
[2] Attouch H (1977) Convergence de fonctions convexes, des sous-différentiels et semi-groupes associés. C. R. Acad. Sci. Paris Sér. A-B. 284(10):A539–A542.Google Scholar
[3] Attouch H (1984) Variational Convergence for Functions and Operators. Applicable Mathematics Series (Pitman Advanced Publishing Program, Boston, MA).Google Scholar
[4] Attouch H, Wets RJB (1990) Epigraphical processes: laws of large numbers for random LSC functions. Sém. Anal. Convexe 20. Exp. 13:29.Google Scholar
[5] Bartlett PL, Bousquet O, Mendelson S (2005) Local Rademacher complexities. Ann. Statist. 33(4):1497–1537.Crossref, Google Scholar
[6] Bartlett PL, Mendelson S (2002) Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res. 3(Nov):463–482.Google Scholar
[7] Boucheron S, Lugosi G, Massart P (2013) Concentration Inequalities: A Nonasymptotic Theory of Independence (Oxford University Press, Oxford, UK).Crossref, Google Scholar
[8] Bousquet O, Elisseeff A (2002) Stability and generalization. J. Mach. Learn. Res. 2(3):499–526.Google Scholar
[9] Clarke F (1983) Optimization and Nonsmooth Analysis (Wiley Interscience, New York).Google Scholar
[10] Davis D, Drusvyatskiy D (2018) Uniform graphical convergence of subgradients in nonconvex optimization and learning. Preprint, submitted December 17, https://arxiv.org/pdf/1810.07590.pdf.Google Scholar
[11] Davis D, Drusvyatskiy D (2019) Stochastic model-based minimization of weakly convex functions. SIAM J. Optim. 29(1):207–239.Crossref, Google Scholar
[12] Davis D, Drusvyatskiy D, Kakade S, Lee J (2018) Stochastic subgradient method converges on tame functions. Preprint, submitted April 20, https://arxiv.org/abs/1804.07795.Google Scholar
[13] Davis D, Drusvyatskiy D, MacPhee K (2018) Stochastic model-based minimization under high-order growth. Preprint, submitted July 1, https://arxiv.org/abs/1807.00255.Google Scholar
[14] Davis D, Drusvyatskiy D, MacPhee KJ, Paquette C (2018) Subgradient methods for sharp weakly convex functions. J. Optim. Theory Appl. 179(3):962–982.Google Scholar
[15] Davis D, Drusvyatskiy D, Paquette C (2017) The nonsmooth landscape of phase retrieval. Preprint, submitted November 9, https://arxiv.org/abs/1711.03247.Google Scholar
[16] Davis D, Grimmer B (2019) Proximally guided stochastic subgradient method for nonsmooth, nonconvex problems. SIAM J. Optim. 29(3):1908–1930.Crossref, Google Scholar
[17] Drusvyatskiy D (2018) The proximal point method revisited. SIAG/OPT Views News. 26(1):1–8.Google Scholar
[18] Drusvyatskiy D, Paquette C (2019) Efficiency of minimizing compositions of convex functions and smooth maps. Math. Program. 178(1-2):503–558.Crossref, Google Scholar
[19] Duchi JC, Ruan F (2018) Solving (most) of a set of quadratic equalities: Composite optimization for robust phase retrieval. Inform. Inference 8(3):471–529.Crossref, Google Scholar
[20] Duchi JC, Ruan F (2018) Stochastic methods for composite and weakly convex optimization problems. SIAM J. Optim. 28(4):3229–3259.Crossref, Google Scholar
[21] Foster D, Sekhari A, Sridharan K (2018) Uniform convergence of gradients for non-convex learning and optimization. Preprint, submitted October 25, https://arxiv.org/abs/1810.11059.Google Scholar
[22] Geyer CJ (1994) On the asymptotics of constrained M-estimation. Ann. Statist. 22(4):1993–2010.Google Scholar
[23] Ghadimi S, Lan G, Zhang H (2016) Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Programming 155(1-2, Ser. A):267–305.Google Scholar
[24] Grünwald PD, Mehta NA (2016) Fast rates for general unbounded loss functions: from ERM to generalized Bayes. Preprint, submitted May 1, https://arxiv.org/abs/1605.00252.Google Scholar
[25] Hardt M, Recht B, Singer Y (2015) Train faster, generalize better: Stability of stochastic gradient descent. Preprint, submitted September 3, https://arxiv.org/abs/1509.01240.Google Scholar
[26] Kakade SM, Sridharan K, Tewari A (2009) On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. Adv. Neural Inform. Process. Syst. 21:793–800.Google Scholar
[27] Kaniovski YM, King AJ, Wets RJB (1995) Probabilistic bounds (via large deviations) for the solutions of stochastic programming problems. Ann. Oper. Res. 56(1):189–208.Google Scholar
[28] King AJ, Rockafellar RT (1993) Asymptotic theory for solutions in statistical estimation and stochastic programming. Math. Oper. Res. 18(1):148–162.Google Scholar
[29] Koller D, Schuurmans D, Bengio Y, Bottou L, eds. (2009) Fast Rates for Regularized Objectives, vol. 21, Advances in Neural Information Processing Systems (Curran Associates, Inc., Red Hook, NY), 1545–1552.Google Scholar
[30] Kontorovich A (2014) Concentration in unbounded metric spaces and algorithmic stability. Internat. Conf. Mach. Learn. 32(2):28–36.Google Scholar
[31] Li X, Zhihui Z, So AC, Vidal R (2018) Nonconvex robust low-rank matrix recovery. Preprint, submitted September 24, https://arxiv.org/abs/1809.09237.Google Scholar
[32] Liu M, Zhang X, Zhang L, Jin R, Yang T (2018) Fast rates of erm and stochastic approximation: Adaptive to error bound conditions. Preprint, submitted May 11, https://arxiv.org/abs/1805.04577.Google Scholar
[33] Maurer A (2016) A vector-contraction inequality for Rademacher complexities. Ortner R, Simon HU, Zilles S, eds. Algorithmic Learning Theory (Springer International Publishing, Cham, Switzerland), 3–17.Crossref, Google Scholar
[34] McDiarmid C (1989) On the method of bounded differences. Surveys in Combinatorics, 1989 (Norwich, 1989), London Mathematical Society Lecture Note Series, vol. 141 (Cambridge University Press, Cambridge), 148–188.Google Scholar
[35] Mehta NA (2016) Fast rates with high probability in exp-concave statistical learning. Preprint, submitted May 4, https://arxiv.org/abs/1605.01288.Google Scholar
[36] Mehta NA, Williamson RC (2014) From stochastic mixability to fast rates. Proc. 27th Internat. Conf. Neural Inform. Processing Sys. (MIT Press, Cambridge, MA).Google Scholar
[37] Mei S, Bai Y, Montanari A (2018) The landscape of empirical risk for nonconvex losses. Ann. Statist. 46(6A):2747–2774.Google Scholar
[38] Mordukhovich B (2006) Variational Analysis and Generalized Differentiation I: Basic Theory, vol. 330 (Grundlehren der mathematischen Wissenschaften, Springer, Berlin).Crossref, Google Scholar
[39] Mordukhovich BS (2018) Variational Analysis and Applications. Springer Monographs in Mathematics (Springer, Cham, Switzerland).Crossref, Google Scholar
[40] Moreau JJ (1965) Proximité et dualité dans un espace hilbertien. Bull. Soc. Math. France 93:273–299.Google Scholar
[41] Nemirovski A, Juditsky A, Lan G, Shapiro A (2008) Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4):1574–1609.Google Scholar
[42] Nurminskii EA (1973) The quasigradient method for the solving of the nonlinear programming problems. Cybernetics 9(1):145–150.Google Scholar
[43] Poliquin R, Rockafellar R (1996) Prox-regular functions in variational analysis. Trans. Amer. Math. Soc. 348:1805–1838.Crossref, Google Scholar
[44] Rachev ST, Römisch W (2002) Quantitative stability in stochastic programming: The method of probability metrics. Math. Oper. Res. 27(4):792–818.Google Scholar
[45] Rakhlin A, Mukherjee S, Poggio T (2005) Stability results in learning theory. Anal. Appl. 3(4):397–417.Google Scholar
[46] Ralph D, Xu H (2011) Convergence of stationary points of sample average two-stage stochastic programs: a generalized equation approach. Math. Oper. Res. 36(3):568–592.Google Scholar
[47] Robinson SM (1996) Analysis of sample-path optimization. Math. Oper. Res. 21(3):513–528.Google Scholar
[48] Rockafellar RT (1970) Convex Analysis. Princeton Mathematical Series, No. 28 (Princeton University Press, Princeton, NJ).Crossref, Google Scholar
[49] Rockafellar R (1982) Favorable classes of Lipschitz-continuous functions in subgradient optimization. Progress in Nondifferentiable Optimization, vol. 8, IIASA Collaborative Proc. Ser. CP-82 (International Institute for Applied Systems Analysis, Laxenburg, Austria), 125–143.Google Scholar
[50] Rockafellar R, Wets RB (1998) Variational Analysis, vol. 317 (Grundlehren der mathematischen Wissenschaften, Springer, Berlin).Crossref, Google Scholar
[51] Rolewicz S (1979) On paraconvex multifunctions. Third Symposium on Operations Research (Univ. Mannheim, Mannheim, 1978), section I, vol. 31, Oper. Res. Verfahren (Hain, Königstein/Ts.), 539–546.Google Scholar
[52] Römisch W, Wets R (2007) Stability of ɛ-approximate solutions to convex stochastic programs. SIAM J. Optim. 18(3):961–979.Google Scholar
[53] Shalev-Shwartz S, Ben-David S (2014) Understanding Machine Learning: From Theory to Algorithms, 1st ed. (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
[54] Shalev-Shwartz S, Shamir O, Srebro N, Sridharan K (2009) Stochastic convex optimization. Proc. Conf. Learn. Theory (COLT).Google Scholar
[55] Shapiro A (2000) On the asymptotics of constrained local M-estimators. Ann. Statist. 28(3):948–960.Google Scholar
[56] Shapiro A (2000) Stochastic Programming by Monte Carlo Simulation Methods (Humboldt-Universität zu Berlin, Mathematisch-Naturwissenschaftliche Fakultät II, Institut für Mathematik, Berlin).Google Scholar
[57] Shapiro A, Homem-de Mello T (2000) On the rate of convergence of optimal solutions of Monte Carlo approximations of stochastic programs. SIAM J. Optim. 11(1):70–86.Google Scholar
[58] Shapiro A, Xu H (2007) Uniform laws of large numbers for set-valued mappings and subdifferentials of random functions. J. Math. Anal. Appl. 325(2):1390–1399.Google Scholar
[59] Shapiro A, Dentcheva D, Ruszczyński A (2014) Lectures on stochastic programming, vol. 9, MOS-SIAM Series on Optimization (Society for Industrial and Applied Mathematics (SIAM), Mathematical Optimization Society, Philadelphia, PA).Google Scholar
[60] Toulis P, Airoldi E (2017) Asymptotic and finite-sample properties of estimators based on stochastic gradients. Ann. Statist. 45(4):1694–1727.Google Scholar
[61] van Erven T, Grünwald PD, Mehta NA, Reid MD, Williamson RC (2015) Fast rates in statistical and online learning. J. Mach. Learn. Res. 16:1793–1861.Google Scholar
[62] Vershynin R (2010) Introduction to the non-asymptotic analysis of random matrices. Preprint, submitted November 12, https://arxiv.org/abs/1011.3027.Google Scholar
[63] Vershynin R (2018) High-Dimensional Probability: An Introduction with Applications in Data Science, vol. 47 (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
[64] Xu H (2010) Uniform exponential convergence of sample average random functions under general sampling with applications in stochastic programming. J. Math. Anal. Appl. 368(2):692–710.Google Scholar
[65] Xu H, Zhang D (2009) Smooth sample average approximation of stationary points in nonsmooth stochastic optimization and applications. Math. Programming 119(2):371–401.Google Scholar
[66] Zemel RS, Culotta A (2010) Smoothness, low noise and fast rates. Lafferty JD,Williams CKI, Shawe-Taylor J, eds. vol. 23, Advances in Neural Information Processing Systems (Curran Associates, Inc., Red Hook, NY), 2199–2207.Google Scholar
[67] Zhang S, He N (2018) On the convergence rate of stochastic mirror descent for nonsmooth nonconvex optimization. Preprint, submitted June 12, https://arxiv.org/abs/1806.04781.Google Scholar
[68] Zinkevich M (2003) Online convex programming and generalized infinitesimal gradient ascent. Proc. 20th Internat. Conf. Mach. Learn. (AAAI Press), 928–935.Google Scholar

cover image Mathematics of Operations Research

Volume 47, Issue 1

February 2022

Pages 1-846, C2

Article Information

Metrics

Information

Received:November 18, 2018
Accepted:September 26, 2020
Published Online:April 20, 2021

Cite as

Damek Davis, Dmitriy Drusvyatskiy (2021) Graphical Convergence of Subgradients in Nonconvex Optimization and Learning. Mathematics of Operations Research 47(1):209-231.

https://doi.org/10.1287/moor.2021.1126

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Graphical Convergence of Subgradients in Nonconvex Optimization and Learning

References

Volume 47, Issue 1

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News