Towards Optimal Problem Dependent Generalization Error Bounds in Statistical Learning Theory
Published Online:19 Jan 2024https://doi.org/10.1287/moor.2021.0076
References
- [1] (2017) Efficient policy learning. Preprint, submitted February 9, https://arxiv.org/abs/1702.02896.Google Scholar
- [2] (2017) Statistical guarantees for the EM algorithm: From population to sample-based analysis. Ann. Statist. 45(1):77–120.Crossref, Google Scholar
- [3] (2020) Failures of model-dependent generalization bounds for least-norm interpolation. Preprint, submitted October 16, https://arxiv.org/abs/2010.08479.Google Scholar
- [4] (2005) Local Rademacher complexities. Ann. Statist. 33(4):1497–1537.Crossref, Google Scholar
- [5] (2017) Spectrally-normalized margin bounds for neural networks. Adv. Neural Inform. Processing Systems (The MIT Press, Cambridge, MA), 6240–6249.Google Scholar
- [6] (2019) Robust Wasserstein profile inference and applications to machine learning. J. Appl. Probab. 56(3):830–857.Crossref, Google Scholar
- [7] (2009) Tighter bounds for structured estimation. Adv. Neural Inform. Processing Systems (The MIT Press, Cambridge, MA), 281–288.Google Scholar
- [8] (2018) Graphical convergence of subgradients in nonconvex optimization and learning. Preprint, submitted October 17, https://arxiv.org/abs/1810.07590.Google Scholar
- [9] (1995) Lower bounds in pattern recognition and learning. Pattern Recognition 28(7):1011–1018.Crossref, Google Scholar
- [10] (2019) Orthogonal statistical learning. Preprint, submitted January 25, https://arxiv.org/abs/1901.09036.Google Scholar
- [11] (2018) Uniform convergence of gradients for non-convex learning and optimization. Adv. Neural Inform. Processing Systems (The MIT Press, Cambridge, MA), 8745–8756.Google Scholar
- [12] (2013) Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4):2341–2368.Crossref, Google Scholar
- [13] (2006) Concentration inequalities and asymptotic results for ratio type empirical processes. Ann. Probab. 34(3):1143–1216.Crossref, Google Scholar
- [14] (2018) Size-independent sample complexity of neural networks. Conf. Learn. Theory (PMLR, New York), 297–299.Google Scholar
- [15] (2006) A Distribution-Free Theory of Nonparametric Regression (Springer Science & Business Media, New York).Google Scholar
- [16] (1972) Local asymptotic minimax and admissibility in estimation. Proc. Sixth Berkeley Sympos. Math. Statist. Probab., vol. 1 (University of California Press, California), 175–194.Google Scholar
- [17] (2016) Identity matters in deep learning. Preprint, submitted November 14, https://arxiv.org/abs/1611.04231.Google Scholar
- [18] (2018) Gradient descent learns linear dynamical systems. J. Machine Learn. Res. 19(1):1025–1068.Google Scholar
- [19] (2020) Near-optimal methods for minimizing star-convex functions and beyond. Conf. Learn. Theory (PMLR, New York), 1894–1938.Google Scholar
- [20] (2020) Instability, computational efficiency and statistical accuracy. Preprint, submitted May 22, https://arxiv.org/abs/2005.11411.Google Scholar
- [21] (2009) On the generalization ability of online strongly convex programming algorithms. Adv. Neural Inform. Processing Systems (The MIT Press, Cambridge, MA), 801–808.Google Scholar
- [22] (2016) Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition. Joint Eur. Conf. Machine Learn. Knowledge Discovery Databases (Springer, Berlin, Heidelberg), 795–811.Google Scholar
- [23] (2018) An alternative view: When does SGD escape local minima? Preprint, submitted February 17, https://arxiv.org/abs/1802.06175.Google Scholar
- [24] (2011) Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems: Ecole d’Eté de Probabilités de Saint-Flour XXXVIII-2008, vol. 2033 (Springer Science & Business Media, New York).Crossref, Google Scholar
- [25] (2000) Rademacher processes and bounding the risk of function learning. High Dimensional Probability II (Springer, Berlin, Heidelberg), 443–457.Crossref, Google Scholar
- [26] (2018) Global convergence of EM algorithm for mixtures of two component linear regression. Preprint, submitted October 12, https://arxiv.org/abs/1810.05752.Google Scholar
- [27] (2012) Regime change: Bit-depth vs. measurement-rate in compressive sensing. IEEE Trans. Signal Processing 60(7):3496–3505.Crossref, Google Scholar
- [28] (1972) Limits of experiments. Proc. Sixth Berkeley Sympos. Math. Statist. Probab., vol. 1: Theory Statist. (The Regents of the University of California, California).Google Scholar
- [29] (2021) Sharper generalization bounds for learning with gradient-dominated objective functions. Internat. Conf. Learning Representations (ICLR, Appleton, WI).Google Scholar
- [30] (2017) Convergence analysis of two-layer neural networks with ReLU activation. Adv. Neural Inform. Processing Systems (The MIT Press, Cambridge, MA), 597–607.Google Scholar
- [31] (2019) Rapid, robust, and reliable blind deconvolution via nonconvex optimization. Appl. Comput. Harmonic Anal. 47(3):893–934.Crossref, Google Scholar
- [32] (2015) Learning with square loss: Localization through offset Rademacher complexity. Conf. Learn. Theory (PMLR, New York), 1260–1285.Google Scholar
- [33] (2005) Smallest singular value of random matrices and geometry of random polytopes. Adv. Math. 195(2):491–523.Crossref, Google Scholar
- [34] (2016) Quadratic optimization with orthogonality constraints: Explicit Lojasiewicz exponent and linear convergence of line-search methods. Internat. Conf. Machine Learn. (PMLR, New York), 1158–1167.Google Scholar
- [35] (2018) Fast rates of ERM and stochastic approximation: Adaptive to error bound conditions. Adv. Neural Inform. Processing Systems (The MIT Press, Cambridge, MA), 4678–4689.Google Scholar
- [36] (2002) Pattern classification and learning theory. Principles of Nonparametric Learning (Springer, Berlin, Heidelberg), 1–56.Crossref, Google Scholar
- [37] (2006) Risk bounds for statistical learning. Ann. Statist. 34(5):2326–2366.Crossref, Google Scholar
- [38] (2009) Empirical Bernstein bounds and sample variance penalization. Preprint, submitted July 21, https://arxiv.org/abs/0907.3740.Google Scholar
- [39] (2007) The EM Algorithm and Extensions, vol. 382 (John Wiley & Sons, Hoboken, NJ).Google Scholar
- [40] (2018) The landscape of empirical risk for nonconvex losses. Ann. Statist. 46(6A):2747–2774.Crossref, Google Scholar
- [41] (2014) Learning without concentration. Conf. Learn. Theory (PMLR, New York), 25–39.Google Scholar
- [42] (2017) On aggregation for heavy-tailed classes. Probab. Theory Related Fields 168(3–4):641–674.Crossref, Google Scholar
- [43] (2018) Learning without concentration for general loss functions. Probab. Theory Related Fields 171(1–2):459–502.Crossref, Google Scholar
- [44] (2018) Robust covariance estimation under l4−l2 norm equivalence. Preprint, submitted September 27, https://arxiv.org/abs/1809.10462.Google Scholar
- [45] (2019) Uniform convergence may be unable to explain generalization in deep learning. Adv. Neural Inform. Processing Systems (The MIT Press, Cambridge, MA), 11611–11622.Google Scholar
- [46] (2017) Variance-based regularization with convex objectives. Adv. Neural Inform. Processing Systems (The MIT Press, Cambridge, MA), 2971–2980.Google Scholar
- [47] (2013) Algorithms for direct 0–1 loss optimization in binary classification. Internat. Conf. Machine Learn. (ACM, New York), 1085–1093.Google Scholar
- [48] (2012) Convergence of Stochastic Processes (Springer Science & Business Media, New York).Google Scholar
- [49] (2016) Stochastic variance reduction for nonconvex optimization. Internat. Conf. Machine Learn. (PMLR, New York), 314–323.Google Scholar
- [50] (1964) Principles of Mathematical Analysis, vol. 3 (McGraw-Hill, New York).Google Scholar
- [51] (2009) Stochastic convex optimization. Conf. Learn. Theory (PMLR, New York), 5–13.Google Scholar
- [52] (2010) Learnability, stability and uniform convergence. J. Machine Learn. Res. 11:2635–2670.Google Scholar
- [53] (2009) Lectures on Stochastic Programming: Modeling and Theory (SIAM, Philadelphia).Crossref, Google Scholar
- [54] (2010) Note on refined Dudley integral covering number bound. Technical report, https://ttic.uchicago.edu/∼karthik/dudley.pdf.Google Scholar
- [55] (2018) A geometric analysis of phase retrieval. Foundations Comput. Math. 18(5):1131–1198.Crossref, Google Scholar
- [56] (2015) Counterfactual risk minimization: Learning from logged bandit feedback. Internat. Conf. Machine Learn. (ACM, New York), 814–823.Google Scholar
- [57] (2003) Optimal rates of aggregation. Learning Theory and Kernel Machines (Springer, Berlin, Heidelberg), 303–313.Crossref, Google Scholar
- [58] (1989) On the asymptotic information bound. Ann. Statist. 17(4):1487–1500.Crossref, Google Scholar
- [59] (2000) Asymptotic Statistics, vol. 3 (Cambridge University Press, Cambridge, UK).Google Scholar
- [60] (1996) Weak Convergence and Empirical Processes (Springer, Berlin, Heidelberg).Crossref, Google Scholar
- [61] (2018) High-Dimensional Probability: An Introduction with Applications in Data Science, vol. 47 (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
- [62] (2019) High-Dimensional Statistics: A Non-Asymptotic Viewpoint, vol. 48 (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
- [63] (2014) High dimensional expectation-maximization algorithm: Statistical optimization and asymptotic normality. Preprint, submitted December 30, https://arxiv.org/abs/1412.8729.Google Scholar
- [64] (2020) Toward problem-dependent optimal learning rates. Adv. Neural Inform. Processing Systems, vol. 33 (The MIT Press, Cambridge, MA), 2196–2206.Google Scholar
- [65] (2019) Stochastic approximation of smooth and strongly convex functions: Beyond the o(1/t) convergence rate. Preprint, submitted January 27, https://arxiv.org/abs/1901.09344.Google Scholar
- [66] (2017) Empirical risk minimization for stochastic convex optimization: o(1/n)-and o(1/n2)-type of risk bounds. Preprint, submitted February 7, https://arxiv.org/abs/1702.02030.Google Scholar
- [67] (2017) Optimal learning via local entropies and sample compression. Preprint, submitted June 4, https://arxiv.org/abs/1706.01124.Google Scholar
- [68] (2016) Localization of VC classes: Beyond local Rademacher complexities. Internat. Conf. Algorithmic Learn. Theory (Springer, Berlin, Heidelberg), 18–33.Google Scholar
- [69] (2020) On uniform convergence and low-norm interpolation learning. Preprint, submitted June 10, https://arxiv.org/abs/2006.05942.Google Scholar
- [70] (2019) SGD converges to global minimum in deep learning via star-convex path. Preprint, submitted January 2, https://arxiv.org/abs/1901.00451.Google Scholar

