Towards Optimal Problem Dependent Generalization Error Bounds in Statistical Learning Theory

Published Online:https://doi.org/10.1287/moor.2021.0076

References

  • [1] Athey S, Wager S (2017) Efficient policy learning. Preprint, submitted February 9, https://arxiv.org/abs/1702.02896.Google Scholar
  • [2] Balakrishnan S, Wainwright MJ, Yu B (2017) Statistical guarantees for the EM algorithm: From population to sample-based analysis. Ann. Statist. 45(1):77–120.CrossrefGoogle Scholar
  • [3] Bartlett PL, Long PM (2020) Failures of model-dependent generalization bounds for least-norm interpolation. Preprint, submitted October 16, https://arxiv.org/abs/2010.08479.Google Scholar
  • [4] Bartlett PL, Bousquet O, Mendelson S (2005) Local Rademacher complexities. Ann. Statist. 33(4):1497–1537.CrossrefGoogle Scholar
  • [5] Bartlett PL, Foster DJ, Telgarsky MJ (2017) Spectrally-normalized margin bounds for neural networks. Adv. Neural Inform. Processing Systems (The MIT Press, Cambridge, MA), 6240–6249.Google Scholar
  • [6] Blanchet J, Kang Y, Murthy K (2019) Robust Wasserstein profile inference and applications to machine learning. J. Appl. Probab. 56(3):830–857.CrossrefGoogle Scholar
  • [7] Chapelle O, Do CB, Teo CH, Le QV, Smola AJ (2009) Tighter bounds for structured estimation. Adv. Neural Inform. Processing Systems (The MIT Press, Cambridge, MA), 281–288.Google Scholar
  • [8] Davis D, Drusvyatskiy D (2018) Graphical convergence of subgradients in nonconvex optimization and learning. Preprint, submitted October 17, https://arxiv.org/abs/1810.07590.Google Scholar
  • [9] Devroye L, Lugosi G (1995) Lower bounds in pattern recognition and learning. Pattern Recognition 28(7):1011–1018.CrossrefGoogle Scholar
  • [10] Foster DJ, Syrgkanis V (2019) Orthogonal statistical learning. Preprint, submitted January 25, https://arxiv.org/abs/1901.09036.Google Scholar
  • [11] Foster DJ, Sekhari A, Sridharan K (2018) Uniform convergence of gradients for non-convex learning and optimization. Adv. Neural Inform. Processing Systems (The MIT Press, Cambridge, MA), 8745–8756.Google Scholar
  • [12] Ghadimi S, Lan G (2013) Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4):2341–2368.CrossrefGoogle Scholar
  • [13] Giné E, Koltchinskii V (2006) Concentration inequalities and asymptotic results for ratio type empirical processes. Ann. Probab. 34(3):1143–1216.CrossrefGoogle Scholar
  • [14] Golowich N, Rakhlin A, Shamir O (2018) Size-independent sample complexity of neural networks. Conf. Learn. Theory (PMLR, New York), 297–299.Google Scholar
  • [15] Györfi L, Kohler M, Krzyzak A, Walk H (2006) A Distribution-Free Theory of Nonparametric Regression (Springer Science & Business Media, New York).Google Scholar
  • [16] Hájek J (1972) Local asymptotic minimax and admissibility in estimation. Proc. Sixth Berkeley Sympos. Math. Statist. Probab., vol. 1 (University of California Press, California), 175–194.Google Scholar
  • [17] Hardt M, Ma T (2016) Identity matters in deep learning. Preprint, submitted November 14, https://arxiv.org/abs/1611.04231.Google Scholar
  • [18] Hardt M, Ma T, Recht B (2018) Gradient descent learns linear dynamical systems. J. Machine Learn. Res. 19(1):1025–1068.Google Scholar
  • [19] Hinder O, Sidford A, Sohoni N (2020) Near-optimal methods for minimizing star-convex functions and beyond. Conf. Learn. Theory (PMLR, New York), 1894–1938.Google Scholar
  • [20] Ho N, Khamaru K, Dwivedi R, Wainwright MJ, Jordan MI, Yu B (2020) Instability, computational efficiency and statistical accuracy. Preprint, submitted May 22, https://arxiv.org/abs/2005.11411.Google Scholar
  • [21] Kakade SM, Tewari A (2009) On the generalization ability of online strongly convex programming algorithms. Adv. Neural Inform. Processing Systems (The MIT Press, Cambridge, MA), 801–808.Google Scholar
  • [22] Karimi H, Nutini J, Schmidt M (2016) Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition. Joint Eur. Conf. Machine Learn. Knowledge Discovery Databases (Springer, Berlin, Heidelberg), 795–811.Google Scholar
  • [23] Kleinberg R, Li Y, Yuan Y (2018) An alternative view: When does SGD escape local minima? Preprint, submitted February 17, https://arxiv.org/abs/1802.06175.Google Scholar
  • [24] Koltchinskii V (2011) Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems: Ecole d’Eté de Probabilités de Saint-Flour XXXVIII-2008, vol. 2033 (Springer Science & Business Media, New York).CrossrefGoogle Scholar
  • [25] Koltchinskii V, Panchenko D (2000) Rademacher processes and bounding the risk of function learning. High Dimensional Probability II (Springer, Berlin, Heidelberg), 443–457.CrossrefGoogle Scholar
  • [26] Kwon J, Qian W, Caramanis C, Chen Y, Davis D (2018) Global convergence of EM algorithm for mixtures of two component linear regression. Preprint, submitted October 12, https://arxiv.org/abs/1810.05752.Google Scholar
  • [27] Laska JN, Baraniuk RG (2012) Regime change: Bit-depth vs. measurement-rate in compressive sensing. IEEE Trans. Signal Processing 60(7):3496–3505.CrossrefGoogle Scholar
  • [28] Le Cam L (1972) Limits of experiments. Proc. Sixth Berkeley Sympos. Math. Statist. Probab., vol. 1: Theory Statist. (The Regents of the University of California, California).Google Scholar
  • [29] Lei Y, Ying Y (2021) Sharper generalization bounds for learning with gradient-dominated objective functions. Internat. Conf. Learning Representations (ICLR, Appleton, WI).Google Scholar
  • [30] Li Y, Yuan Y (2017) Convergence analysis of two-layer neural networks with ReLU activation. Adv. Neural Inform. Processing Systems (The MIT Press, Cambridge, MA), 597–607.Google Scholar
  • [31] Li X, Ling S, Strohmer T, Wei K (2019) Rapid, robust, and reliable blind deconvolution via nonconvex optimization. Appl. Comput. Harmonic Anal. 47(3):893–934.CrossrefGoogle Scholar
  • [32] Liang T, Rakhlin A, Sridharan K (2015) Learning with square loss: Localization through offset Rademacher complexity. Conf. Learn. Theory (PMLR, New York), 1260–1285.Google Scholar
  • [33] Litvak AE, Pajor A, Rudelson M, Tomczak-Jaegermann N (2005) Smallest singular value of random matrices and geometry of random polytopes. Adv. Math. 195(2):491–523.CrossrefGoogle Scholar
  • [34] Liu H, Su W, So AM-C (2016) Quadratic optimization with orthogonality constraints: Explicit Lojasiewicz exponent and linear convergence of line-search methods. Internat. Conf. Machine Learn. (PMLR, New York), 1158–1167.Google Scholar
  • [35] Liu M, Zhang X, Zhang L, Jin R, Yang T (2018) Fast rates of ERM and stochastic approximation: Adaptive to error bound conditions. Adv. Neural Inform. Processing Systems (The MIT Press, Cambridge, MA), 4678–4689.Google Scholar
  • [36] Lugosi G (2002) Pattern classification and learning theory. Principles of Nonparametric Learning (Springer, Berlin, Heidelberg), 1–56.CrossrefGoogle Scholar
  • [37] Massart P, Nédélec É (2006) Risk bounds for statistical learning. Ann. Statist. 34(5):2326–2366.CrossrefGoogle Scholar
  • [38] Maurer A, Pontil M (2009) Empirical Bernstein bounds and sample variance penalization. Preprint, submitted July 21, https://arxiv.org/abs/0907.3740.Google Scholar
  • [39] McLachlan GJ, Krishnan T (2007) The EM Algorithm and Extensions, vol. 382 (John Wiley & Sons, Hoboken, NJ).Google Scholar
  • [40] Mei S, Bai Y, Montanari A (2018) The landscape of empirical risk for nonconvex losses. Ann. Statist. 46(6A):2747–2774.CrossrefGoogle Scholar
  • [41] Mendelson S (2014) Learning without concentration. Conf. Learn. Theory (PMLR, New York), 25–39.Google Scholar
  • [42] Mendelson S (2017) On aggregation for heavy-tailed classes. Probab. Theory Related Fields 168(3–4):641–674.CrossrefGoogle Scholar
  • [43] Mendelson S (2018) Learning without concentration for general loss functions. Probab. Theory Related Fields 171(1–2):459–502.CrossrefGoogle Scholar
  • [44] Mendelson S, Zhivotovskiy N (2018) Robust covariance estimation under l4−l2 norm equivalence. Preprint, submitted September 27, https://arxiv.org/abs/1809.10462.Google Scholar
  • [45] Nagarajan V, Zico Kolter J (2019) Uniform convergence may be unable to explain generalization in deep learning. Adv. Neural Inform. Processing Systems (The MIT Press, Cambridge, MA), 11611–11622.Google Scholar
  • [46] Namkoong H, Duchi JC (2017) Variance-based regularization with convex objectives. Adv. Neural Inform. Processing Systems (The MIT Press, Cambridge, MA), 2971–2980.Google Scholar
  • [47] Nguyen T, Sanner S (2013) Algorithms for direct 0–1 loss optimization in binary classification. Internat. Conf. Machine Learn. (ACM, New York), 1085–1093.Google Scholar
  • [48] Pollard D (2012) Convergence of Stochastic Processes (Springer Science & Business Media, New York).Google Scholar
  • [49] Reddi SJ, Hefny A, Sra S, Póczos B, Smola A (2016) Stochastic variance reduction for nonconvex optimization. Internat. Conf. Machine Learn. (PMLR, New York), 314–323.Google Scholar
  • [50] Rudin W (1964) Principles of Mathematical Analysis, vol. 3 (McGraw-Hill, New York).Google Scholar
  • [51] Shalev-Shwartz S (2009) Stochastic convex optimization. Conf. Learn. Theory (PMLR, New York), 5–13.Google Scholar
  • [52] Shalev-Shwartz S, Shamir O, Srebro N, Sridharan K (2010) Learnability, stability and uniform convergence. J. Machine Learn. Res. 11:2635–2670.Google Scholar
  • [53] Shapiro A, Dentcheva D, Ruszczyński A (2009) Lectures on Stochastic Programming: Modeling and Theory (SIAM, Philadelphia).CrossrefGoogle Scholar
  • [54] Srebro N, Sridharan K (2010) Note on refined Dudley integral covering number bound. Technical report, https://ttic.uchicago.edu/∼karthik/dudley.pdf.Google Scholar
  • [55] Sun J, Qu Q, Wright J (2018) A geometric analysis of phase retrieval. Foundations Comput. Math. 18(5):1131–1198.CrossrefGoogle Scholar
  • [56] Swaminathan A, Joachims T (2015) Counterfactual risk minimization: Learning from logged bandit feedback. Internat. Conf. Machine Learn. (ACM, New York), 814–823.Google Scholar
  • [57] Tsybakov AB (2003) Optimal rates of aggregation. Learning Theory and Kernel Machines (Springer, Berlin, Heidelberg), 303–313.CrossrefGoogle Scholar
  • [58] Van der Vaart AW (1989) On the asymptotic information bound. Ann. Statist. 17(4):1487–1500.CrossrefGoogle Scholar
  • [59] Van der Vaart AW (2000) Asymptotic Statistics, vol. 3 (Cambridge University Press, Cambridge, UK).Google Scholar
  • [60] Van der Vaart AW, Wellner JA (1996) Weak Convergence and Empirical Processes (Springer, Berlin, Heidelberg).CrossrefGoogle Scholar
  • [61] Vershynin R (2018) High-Dimensional Probability: An Introduction with Applications in Data Science, vol. 47 (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • [62] Wainwright MJ (2019) High-Dimensional Statistics: A Non-Asymptotic Viewpoint, vol. 48 (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • [63] Wang Z, Gu Q, Ning Y, Liu H (2014) High dimensional expectation-maximization algorithm: Statistical optimization and asymptotic normality. Preprint, submitted December 30, https://arxiv.org/abs/1412.8729.Google Scholar
  • [64] Xu Y, Zeevi A (2020) Toward problem-dependent optimal learning rates. Adv. Neural Inform. Processing Systems, vol. 33 (The MIT Press, Cambridge, MA), 2196–2206.Google Scholar
  • [65] Zhang L, Zhou Z-H (2019) Stochastic approximation of smooth and strongly convex functions: Beyond the o(1/t) convergence rate. Preprint, submitted January 27, https://arxiv.org/abs/1901.09344.Google Scholar
  • [66] Zhang L, Yang T, Jin R (2017) Empirical risk minimization for stochastic convex optimization: o(1/n)-and o(1/n2)-type of risk bounds. Preprint, submitted February 7, https://arxiv.org/abs/1702.02030.Google Scholar
  • [67] Zhivotovskiy N (2017) Optimal learning via local entropies and sample compression. Preprint, submitted June 4, https://arxiv.org/abs/1706.01124.Google Scholar
  • [68] Zhivotovskiy N, Hanneke S (2016) Localization of VC classes: Beyond local Rademacher complexities. Internat. Conf. Algorithmic Learn. Theory (Springer, Berlin, Heidelberg), 18–33.Google Scholar
  • [69] Zhou L, Sutherland DJ, Srebro N (2020) On uniform convergence and low-norm interpolation learning. Preprint, submitted June 10, https://arxiv.org/abs/2006.05942.Google Scholar
  • [70] Zhou Y, Yang J, Zhang H, Liang Y, Tarokh V (2019) SGD converges to global minimum in deep learning via star-convex path. Preprint, submitted January 2, https://arxiv.org/abs/1901.00451.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.