Towards Optimal Problem Dependent Generalization Error Bounds in Statistical Learning Theory

Yunbei Xu
Corresponding Author
Yunbei Xu
[email protected]
https://orcid.org/0000-0003-4530-764X
Decision, Risk, and Operations Division, Graduate School of Business, Columbia University, New York, New York 10027
Search for more papers by this author
,
Assaf Zeevi
Assaf Zeevi
[email protected]
https://orcid.org/0000-0003-1075-6664
Decision, Risk, and Operations Division, Graduate School of Business, Columbia University, New York, New York 10027
Search for more papers by this author

Yunbei Xu

Corresponding Author

Yunbei Xu

[email protected]

https://orcid.org/0000-0003-4530-764X

Decision, Risk, and Operations Division, Graduate School of Business, Columbia University, New York, New York 10027

Search for more papers by this author

Assaf Zeevi

[email protected]

https://orcid.org/0000-0003-1075-6664

Decision, Risk, and Operations Division, Graduate School of Business, Columbia University, New York, New York 10027

Search for more papers by this author

Published Online:19 Jan 2024https://doi.org/10.1287/moor.2021.0076

References

[1] Athey S, Wager S (2017) Efficient policy learning. Preprint, submitted February 9, https://arxiv.org/abs/1702.02896.Google Scholar
[2] Balakrishnan S, Wainwright MJ, Yu B (2017) Statistical guarantees for the EM algorithm: From population to sample-based analysis. Ann. Statist. 45(1):77–120.Crossref, Google Scholar
[3] Bartlett PL, Long PM (2020) Failures of model-dependent generalization bounds for least-norm interpolation. Preprint, submitted October 16, https://arxiv.org/abs/2010.08479.Google Scholar
[4] Bartlett PL, Bousquet O, Mendelson S (2005) Local Rademacher complexities. Ann. Statist. 33(4):1497–1537.Crossref, Google Scholar
[5] Bartlett PL, Foster DJ, Telgarsky MJ (2017) Spectrally-normalized margin bounds for neural networks. Adv. Neural Inform. Processing Systems (The MIT Press, Cambridge, MA), 6240–6249.Google Scholar
[6] Blanchet J, Kang Y, Murthy K (2019) Robust Wasserstein profile inference and applications to machine learning. J. Appl. Probab. 56(3):830–857.Crossref, Google Scholar
[7] Chapelle O, Do CB, Teo CH, Le QV, Smola AJ (2009) Tighter bounds for structured estimation. Adv. Neural Inform. Processing Systems (The MIT Press, Cambridge, MA), 281–288.Google Scholar
[8] Davis D, Drusvyatskiy D (2018) Graphical convergence of subgradients in nonconvex optimization and learning. Preprint, submitted October 17, https://arxiv.org/abs/1810.07590.Google Scholar
[9] Devroye L, Lugosi G (1995) Lower bounds in pattern recognition and learning. Pattern Recognition 28(7):1011–1018.Crossref, Google Scholar
[10] Foster DJ, Syrgkanis V (2019) Orthogonal statistical learning. Preprint, submitted January 25, https://arxiv.org/abs/1901.09036.Google Scholar
[11] Foster DJ, Sekhari A, Sridharan K (2018) Uniform convergence of gradients for non-convex learning and optimization. Adv. Neural Inform. Processing Systems (The MIT Press, Cambridge, MA), 8745–8756.Google Scholar
[12] Ghadimi S, Lan G (2013) Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4):2341–2368.Crossref, Google Scholar
[13] Giné E, Koltchinskii V (2006) Concentration inequalities and asymptotic results for ratio type empirical processes. Ann. Probab. 34(3):1143–1216.Crossref, Google Scholar
[14] Golowich N, Rakhlin A, Shamir O (2018) Size-independent sample complexity of neural networks. Conf. Learn. Theory (PMLR, New York), 297–299.Google Scholar
[15] Györfi L, Kohler M, Krzyzak A, Walk H (2006) A Distribution-Free Theory of Nonparametric Regression (Springer Science & Business Media, New York).Google Scholar
[16] Hájek J (1972) Local asymptotic minimax and admissibility in estimation. Proc. Sixth Berkeley Sympos. Math. Statist. Probab., vol. 1 (University of California Press, California), 175–194.Google Scholar
[17] Hardt M, Ma T (2016) Identity matters in deep learning. Preprint, submitted November 14, https://arxiv.org/abs/1611.04231.Google Scholar
[18] Hardt M, Ma T, Recht B (2018) Gradient descent learns linear dynamical systems. J. Machine Learn. Res. 19(1):1025–1068.Google Scholar
[19] Hinder O, Sidford A, Sohoni N (2020) Near-optimal methods for minimizing star-convex functions and beyond. Conf. Learn. Theory (PMLR, New York), 1894–1938.Google Scholar
[20] Ho N, Khamaru K, Dwivedi R, Wainwright MJ, Jordan MI, Yu B (2020) Instability, computational efficiency and statistical accuracy. Preprint, submitted May 22, https://arxiv.org/abs/2005.11411.Google Scholar
[21] Kakade SM, Tewari A (2009) On the generalization ability of online strongly convex programming algorithms. Adv. Neural Inform. Processing Systems (The MIT Press, Cambridge, MA), 801–808.Google Scholar
[22] Karimi H, Nutini J, Schmidt M (2016) Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition. Joint Eur. Conf. Machine Learn. Knowledge Discovery Databases (Springer, Berlin, Heidelberg), 795–811.Google Scholar
[23] Kleinberg R, Li Y, Yuan Y (2018) An alternative view: When does SGD escape local minima? Preprint, submitted February 17, https://arxiv.org/abs/1802.06175.Google Scholar
[24] Koltchinskii V (2011) Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems: Ecole d’Eté de Probabilités de Saint-Flour XXXVIII-2008, vol. 2033 (Springer Science & Business Media, New York).Crossref, Google Scholar
[25] Koltchinskii V, Panchenko D (2000) Rademacher processes and bounding the risk of function learning. High Dimensional Probability II (Springer, Berlin, Heidelberg), 443–457.Crossref, Google Scholar
[26] Kwon J, Qian W, Caramanis C, Chen Y, Davis D (2018) Global convergence of EM algorithm for mixtures of two component linear regression. Preprint, submitted October 12, https://arxiv.org/abs/1810.05752.Google Scholar
[27] Laska JN, Baraniuk RG (2012) Regime change: Bit-depth vs. measurement-rate in compressive sensing. IEEE Trans. Signal Processing 60(7):3496–3505.Crossref, Google Scholar
[28] Le Cam L (1972) Limits of experiments. Proc. Sixth Berkeley Sympos. Math. Statist. Probab., vol. 1: Theory Statist. (The Regents of the University of California, California).Google Scholar
[29] Lei Y, Ying Y (2021) Sharper generalization bounds for learning with gradient-dominated objective functions. Internat. Conf. Learning Representations (ICLR, Appleton, WI).Google Scholar
[30] Li Y, Yuan Y (2017) Convergence analysis of two-layer neural networks with ReLU activation. Adv. Neural Inform. Processing Systems (The MIT Press, Cambridge, MA), 597–607.Google Scholar
[31] Li X, Ling S, Strohmer T, Wei K (2019) Rapid, robust, and reliable blind deconvolution via nonconvex optimization. Appl. Comput. Harmonic Anal. 47(3):893–934.Crossref, Google Scholar
[32] Liang T, Rakhlin A, Sridharan K (2015) Learning with square loss: Localization through offset Rademacher complexity. Conf. Learn. Theory (PMLR, New York), 1260–1285.Google Scholar
[33] Litvak AE, Pajor A, Rudelson M, Tomczak-Jaegermann N (2005) Smallest singular value of random matrices and geometry of random polytopes. Adv. Math. 195(2):491–523.Crossref, Google Scholar
[34] Liu H, Su W, So AM-C (2016) Quadratic optimization with orthogonality constraints: Explicit Lojasiewicz exponent and linear convergence of line-search methods. Internat. Conf. Machine Learn. (PMLR, New York), 1158–1167.Google Scholar
[35] Liu M, Zhang X, Zhang L, Jin R, Yang T (2018) Fast rates of ERM and stochastic approximation: Adaptive to error bound conditions. Adv. Neural Inform. Processing Systems (The MIT Press, Cambridge, MA), 4678–4689.Google Scholar
[36] Lugosi G (2002) Pattern classification and learning theory. Principles of Nonparametric Learning (Springer, Berlin, Heidelberg), 1–56.Crossref, Google Scholar
[37] Massart P, Nédélec É (2006) Risk bounds for statistical learning. Ann. Statist. 34(5):2326–2366.Crossref, Google Scholar
[38] Maurer A, Pontil M (2009) Empirical Bernstein bounds and sample variance penalization. Preprint, submitted July 21, https://arxiv.org/abs/0907.3740.Google Scholar
[39] McLachlan GJ, Krishnan T (2007) The EM Algorithm and Extensions, vol. 382 (John Wiley & Sons, Hoboken, NJ).Google Scholar
[40] Mei S, Bai Y, Montanari A (2018) The landscape of empirical risk for nonconvex losses. Ann. Statist. 46(6A):2747–2774.Crossref, Google Scholar
[41] Mendelson S (2014) Learning without concentration. Conf. Learn. Theory (PMLR, New York), 25–39.Google Scholar
[42] Mendelson S (2017) On aggregation for heavy-tailed classes. Probab. Theory Related Fields 168(3–4):641–674.Crossref, Google Scholar
[43] Mendelson S (2018) Learning without concentration for general loss functions. Probab. Theory Related Fields 171(1–2):459–502.Crossref, Google Scholar
[44] Mendelson S, Zhivotovskiy N (2018) Robust covariance estimation under l4−l2 norm equivalence. Preprint, submitted September 27, https://arxiv.org/abs/1809.10462.Google Scholar
[45] Nagarajan V, Zico Kolter J (2019) Uniform convergence may be unable to explain generalization in deep learning. Adv. Neural Inform. Processing Systems (The MIT Press, Cambridge, MA), 11611–11622.Google Scholar
[46] Namkoong H, Duchi JC (2017) Variance-based regularization with convex objectives. Adv. Neural Inform. Processing Systems (The MIT Press, Cambridge, MA), 2971–2980.Google Scholar
[47] Nguyen T, Sanner S (2013) Algorithms for direct 0–1 loss optimization in binary classification. Internat. Conf. Machine Learn. (ACM, New York), 1085–1093.Google Scholar
[48] Pollard D (2012) Convergence of Stochastic Processes (Springer Science & Business Media, New York).Google Scholar
[49] Reddi SJ, Hefny A, Sra S, Póczos B, Smola A (2016) Stochastic variance reduction for nonconvex optimization. Internat. Conf. Machine Learn. (PMLR, New York), 314–323.Google Scholar
[50] Rudin W (1964) Principles of Mathematical Analysis, vol. 3 (McGraw-Hill, New York).Google Scholar
[51] Shalev-Shwartz S (2009) Stochastic convex optimization. Conf. Learn. Theory (PMLR, New York), 5–13.Google Scholar
[52] Shalev-Shwartz S, Shamir O, Srebro N, Sridharan K (2010) Learnability, stability and uniform convergence. J. Machine Learn. Res. 11:2635–2670.Google Scholar
[53] Shapiro A, Dentcheva D, Ruszczyński A (2009) Lectures on Stochastic Programming: Modeling and Theory (SIAM, Philadelphia).Crossref, Google Scholar
[54] Srebro N, Sridharan K (2010) Note on refined Dudley integral covering number bound. Technical report, https://ttic.uchicago.edu/∼karthik/dudley.pdf.Google Scholar
[55] Sun J, Qu Q, Wright J (2018) A geometric analysis of phase retrieval. Foundations Comput. Math. 18(5):1131–1198.Crossref, Google Scholar
[56] Swaminathan A, Joachims T (2015) Counterfactual risk minimization: Learning from logged bandit feedback. Internat. Conf. Machine Learn. (ACM, New York), 814–823.Google Scholar
[57] Tsybakov AB (2003) Optimal rates of aggregation. Learning Theory and Kernel Machines (Springer, Berlin, Heidelberg), 303–313.Crossref, Google Scholar
[58] Van der Vaart AW (1989) On the asymptotic information bound. Ann. Statist. 17(4):1487–1500.Crossref, Google Scholar
[59] Van der Vaart AW (2000) Asymptotic Statistics, vol. 3 (Cambridge University Press, Cambridge, UK).Google Scholar
[60] Van der Vaart AW, Wellner JA (1996) Weak Convergence and Empirical Processes (Springer, Berlin, Heidelberg).Crossref, Google Scholar
[61] Vershynin R (2018) High-Dimensional Probability: An Introduction with Applications in Data Science, vol. 47 (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
[62] Wainwright MJ (2019) High-Dimensional Statistics: A Non-Asymptotic Viewpoint, vol. 48 (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
[63] Wang Z, Gu Q, Ning Y, Liu H (2014) High dimensional expectation-maximization algorithm: Statistical optimization and asymptotic normality. Preprint, submitted December 30, https://arxiv.org/abs/1412.8729.Google Scholar
[64] Xu Y, Zeevi A (2020) Toward problem-dependent optimal learning rates. Adv. Neural Inform. Processing Systems, vol. 33 (The MIT Press, Cambridge, MA), 2196–2206.Google Scholar
[65] Zhang L, Zhou Z-H (2019) Stochastic approximation of smooth and strongly convex functions: Beyond the o(1/t) convergence rate. Preprint, submitted January 27, https://arxiv.org/abs/1901.09344.Google Scholar
[66] Zhang L, Yang T, Jin R (2017) Empirical risk minimization for stochastic convex optimization: o(1/n)-and o(1/n2)-type of risk bounds. Preprint, submitted February 7, https://arxiv.org/abs/1702.02030.Google Scholar
[67] Zhivotovskiy N (2017) Optimal learning via local entropies and sample compression. Preprint, submitted June 4, https://arxiv.org/abs/1706.01124.Google Scholar
[68] Zhivotovskiy N, Hanneke S (2016) Localization of VC classes: Beyond local Rademacher complexities. Internat. Conf. Algorithmic Learn. Theory (Springer, Berlin, Heidelberg), 18–33.Google Scholar
[69] Zhou L, Sutherland DJ, Srebro N (2020) On uniform convergence and low-norm interpolation learning. Preprint, submitted June 10, https://arxiv.org/abs/2006.05942.Google Scholar
[70] Zhou Y, Yang J, Zhang H, Liang Y, Tarokh V (2019) SGD converges to global minimum in deep learning via star-convex path. Preprint, submitted January 2, https://arxiv.org/abs/1901.00451.Google Scholar

cover image Mathematics of Operations Research

Volume 50, Issue 1

February 2025

Pages 1-781 C2

Article Information

Supplemental Material

Metrics

Information

Received:March 23, 2021
Accepted:October 09, 2023
Published Online:January 19, 2024

Cite as

Yunbei Xu, Assaf Zeevi (2024) Towards Optimal Problem Dependent Generalization Error Bounds in Statistical Learning Theory. Mathematics of Operations Research 50(1):40-67.

https://doi.org/10.1287/moor.2021.0076

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Towards Optimal Problem Dependent Generalization Error Bounds in Statistical Learning Theory

References

Volume 50, Issue 1

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News