Proximal Gradient Methods with Adaptive Subspace Sampling

Dmitry Grishchenko
Dmitry Grishchenko
[email protected]
Laboratoire Jean Kuntzmann, Université Grenoble Alpes, 38402 Saint-Martin-d’Heres, France;Laboratoire d’Informatique de Grenoble, Université Grenoble Alpes, 38401 Saint-Martin-d’Heres, France;
Search for more papers by this author
,
Franck Iutzeler
Corresponding Author
Franck Iutzeler
[email protected]
https://orcid.org/0000-0003-2537-380X
Laboratoire Jean Kuntzmann, Université Grenoble Alpes, 38402 Saint-Martin-d’Heres, France;
Search for more papers by this author
,
Jérôme Malick
Jérôme Malick
[email protected]
Laboratoire Jean Kuntzmann, Université Grenoble Alpes, 38402 Saint-Martin-d’Heres, France;Centre National de la Recherche Scientifique, 75016 Paris, France
Search for more papers by this author

Laboratoire Jean Kuntzmann, Université Grenoble Alpes, 38402 Saint-Martin-d’Heres, France;Laboratoire d’Informatique de Grenoble, Université Grenoble Alpes, 38401 Saint-Martin-d’Heres, France;

Search for more papers by this author

Franck Iutzeler

Corresponding Author

Franck Iutzeler

[email protected]

https://orcid.org/0000-0003-2537-380X

Laboratoire Jean Kuntzmann, Université Grenoble Alpes, 38402 Saint-Martin-d’Heres, France;

Search for more papers by this author

Jérôme Malick

[email protected]

Laboratoire Jean Kuntzmann, Université Grenoble Alpes, 38402 Saint-Martin-d’Heres, France;Centre National de la Recherche Scientifique, 75016 Paris, France

Search for more papers by this author

Published Online:23 Feb 2021https://doi.org/10.1287/moor.2020.1092

References

[1] Bach F, Jenatton R, Mairal J, Obozinski G (2012) Optimization with sparsity-inducing penalties. Foundations Trends Machine Learning 4(1):1–106.Google Scholar
[2] Bauschke HH, Combettes PL (2011) Convex Analysis and Monotone Operator Theory in Hilbert Spaces (Springer Science & Business Media, Berlin).Crossref, Google Scholar
[3] Bertsekas D (1976) On the Goldstein-Levitin-Polyak gradient projection method. IEEE Trans. Automatic Control 21(2):174–184.Crossref, Google Scholar
[4] Bianchi P, Hachem W, Iutzeler F (2016) A coordinate descent primal-dual algorithm and application to distributed asynchronous optimization. IEEE Trans. Automatic Control 61(10):2947–2957.Crossref, Google Scholar
[5] Bubeck S (2015) Convex optimization: Algorithms and complexity. Foundations Trends Machine Learn. 8(3-4):231–357.Google Scholar
[6] Burke JV, Moré JJ (1988) On the identification of active constraints. SIAM J. Numerical Anal. 25(5):1197–1211.Crossref, Google Scholar
[7] Candes EJ, Wakin MB, Boyd SP (2008) Enhancing sparsity by reweighted ℓ 1 minimization. J. Fourier Anal. Appl. 14(5-6):877–905.Crossref, Google Scholar
[8] Combettes PL, Pesquet JC (2007) Proximal thresholding algorithm for minimization over orthonormal bases. SIAM J. Optim. 18(4):1351–1376.Crossref, Google Scholar
[9] Combettes PL, Pesquet JC (2011) Proximal splitting methods in signal processing. Bauschke H, Burachik R, Combettes P, Elser V, Luke D, Wolkowicz H, eds. Fixed-Point Algorithms for Inverse Problems in Science and Engineering, Springer Optimization and its Applications, vol. 49 (Springer, New York), 185–212.Crossref, Google Scholar
[10] Condat L (2013) A direct algorithm for 1D total variation denoising. IEEE Signal Processing Lett. 20(11):1054–1057.Crossref, Google Scholar
[11] Dhillon IS, Ravikumar PK, Tewari A (2011) Nearest neighbor based greedy coordinate descent. Adv. Neural Inform Processing Systems 24:2160–2168.Google Scholar
[12] Donoho DL (1995) De-noising by soft-thresholding. IEEE Trans. Inform. Theory 41(3):613–627.Crossref, Google Scholar
[13] Drusvyatskiy D, Lewis AS (2014) Optimality, identifiability, and sensitivity. Math. Programming 147(1-2):467–498.Crossref, Google Scholar
[14] Fadili J, Garrigos G, Malick J, Peyré G (2019) Model consistency for learning with mirror-stratifiable regularizers. Chaudhuri K, Sugiyama M, eds. 22nd Internat. Conf. Artificial Intelligence Statist. (Proc. of Machine Learn. Res.) 89:1236–1244.Google Scholar
[15] Fadili J, Malick J, Peyré G (2018) Sensitivity analysis for mirror-stratifiable convex functions. SIAM J. Optim. 28(4):2975–3000.Crossref, Google Scholar
[16] Fercoq O, Gramfort A, Salmon J (2015) Mind the duality gap: Safer rules for the lasso. Bach F, Blei D, eds. Proc. 32nd Internat. Conf. Machine Learn. (MLResearchPress), 333–342.Google Scholar
[17] Frongillo R, Reid MD (2015) Convergence analysis of prediction markets via randomized subspace descent. Adv. Neural Inform. Processing Systems 28:3034–3042.Google Scholar
[18] Glasmachers T, Dogan U (2013) Accelerated coordinate descent with adaptive coordinate frequencies. Proc. 5th Asian Conf. Machine Learn. (Proc. Machine Learn. Res.), 29:72–86.Google Scholar
[19] Grishchenko D, Iutzeler F, Malick J, Amini MR (2018) Asynchronous distributed learning with sparse communications and identification. Preprint, submitted December 10, https://arxiv.org/abs/1812.03871.Google Scholar
[20] Hanzely F, Mishchenko K, Richtarik P (2018) SEGA: Variance reduction via gradient sketching. Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, eds. Adv. Neural Inform. Processing Systems, (Curran Associates, Montreal, Canada), 2083–2094.Google Scholar
[21] Hare W, Lewis AS (2004) Identifying active constraints via partial smoothness and prox-regularity. J. Convex Anal. 11(2):251–266.Google Scholar
[22] Lewis AS (2002) Active sets, nonsmoothness, and sensitivity. SIAM J. Optim. 13(3):702–725.Crossref, Google Scholar
[23] Lewis AS, Liang J (2018) Partial smoothness and constant rank. Preprint, submitted July 9, https://arxiv.org/abs/1807.03134.Google Scholar
[24] Liang J, Fadili J, Peyré G (2017) Activity identification and local linear convergence of forward–backward-type methods. SIAM J. Optim. 27(1):408–437.Crossref, Google Scholar
[25] Loshchilov I, Schoenauer M, Sebag M (2011) Adaptive coordinate descent. Proc. 13th Annu. Conf. Genetic Evolutionary Comput. (Association for Computing Machinery, New York), 885–892.Google Scholar
[26] Mishchenko K, Iutzeler F, Malick J (2020) A distributed flexible delay-tolerant proximal gradient algorithm. SIAM J. Optim. 30(1):933–959.Crossref, Google Scholar
[27] Namkoong H, Sinha A, Yadlowsky S, Duchi JC (2017) Adaptive sampling probabilities for non-smooth optimization. Proc. 34th Internat. Conf. Machine Learn. (Proc. Machine Learn. Res.), 70:2574–2583.Google Scholar
[28] Necoara I, Patrascu A (2014) A random coordinate descent algorithm for optimization problems with composite objective function and linear coupled constraints. Comput. Optim. Appl. 57(2):307–337.Crossref, Google Scholar
[29] Nesterov Y (2012) Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2):341–362.Crossref, Google Scholar
[30] Nutini J, Laradji I, Schmidt M (2017) Let’s make block coordinate descent go fast: Faster greedy rules, message-passing, active-set complexity, and superlinear convergence. Preprint, submitted December 23, https://arxiv.org/abs/1712.08859.Google Scholar
[31] Nutini J, Schmidt M, Laradji I, Friedlander M, Koepke H (2015) Coordinate descent converges faster with the Gauss-Southwell rule than random selection. Bach F, Blei D, eds. Proc. 32nd Internat. Conf. Machine Learn. (Proc. Machine Learn. Res.), 37:1632–1641.Google Scholar
[32] Ogawa K, Suzuki Y, Takeuchi I (2013) Safe screening of non-support vectors in pathwise SVM computation. Dasgupta S, McAllester D, eds. Proc. 30th Internat. Conf. Machine Learn. (Proc. Machine Learn. Res.), 38:1382–1390.Google Scholar
[33] Perekrestenko D, Cevher V, Jaggi M (2017) Faster coordinate descent via adaptive importance sampling. Singh A, Zhu J, eds. Proc. 20th Internat. Conf. Artificial Intelligence Statist. (Proc. Machine Learn. Res.), 54:869–877.Google Scholar
[34] Poon C, Liang J, Schoenlieb C (2018) Local convergence properties of SAGA/Prox-SVRG and acceleration. Proc. 35th Internat. Conf. Machine Learn. (Proc. Machine Learn. Res.), 80:4124–4132.Google Scholar
[35] Qu Z, Richtárik P (2016) Coordinate descent with arbitrary sampling I: Algorithms and complexity. Optim. Methods Software 31(5):829–857.Crossref, Google Scholar
[36] Richtárik P, Takáč M (2014) Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Programing 144(1-2):1–38.Crossref, Google Scholar
[37] Richtárik P, Takáč M (2016) On optimal probabilities in stochastic coordinate descent methods. Optim. Lett. 10(6):1233–1243.Crossref, Google Scholar
[38] Stich SU, Raj A, Jaggi M (2017) Safe adaptive importance sampling. Adv. Neural Inform. Processing Systems 30:4381–4391.Google Scholar
[39] Sun Y, Jeong H, Nutini J, Schmidt M (2019) Are we there yet? Manifold identification of gradient-related proximal methods. 22nd Internat. Conf. Artificial Intelligence Statist. (Proc. Machine Learn. Res.), 89:1110–1119.Google Scholar
[40] Teboulle M (2018) A simplified view of first order methods for optimization. Math. Programming 40(1):1–30.Google Scholar
[41] Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K (2005) Sparsity and smoothness via the fused lasso. J. Roy. Statist. Soc. Ser. B. Statist. Methodological 67(1):91–108.Crossref, Google Scholar
[42] Tseng P (2001) Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl. 109(3):475–494.Crossref, Google Scholar
[43] Vaiter S, Golbabaee M, Fadili J, Peyré G (2015) Model selection with low complexity priors. Inform. Inference 4(3):230–287.Google Scholar
[44] Wright SJ (1993) Identifiable surfaces in constrained optimization. SIAM J. Control Optim. 31(4):1063–1079.Crossref, Google Scholar
[45] Wright SJ (2012) Accelerated block-coordinate relaxation for regularized optimization. SIAM J. Optim. 22(1):159–186.Crossref, Google Scholar
[46] Wright SJ (2015) Coordinate descent algorithms. Math. Programming 151(1):3–34.Crossref, Google Scholar
[47] Yuan L, Liu J, Ye J (2011) Efficient methods for overlapping group lasso. Adv. Neural Inform. Processing Systems 24:352–360.Google Scholar
[48] Zhao P, Zhang T (2015) Stochastic optimization with importance sampling for regularized loss minimization. Bach F, Blei D, eds. Proc. 32nd Internat. Conf. Machine Learn. (Proc. Machine Learn. Res.), 37:1–9.Google Scholar

cover image Mathematics of Operations Research

Volume 46, Issue 4

November 2021

Pages 1235-1657, C2

Article Information

Metrics

Information

Received:January 07, 2019
Accepted:April 24, 2020
Published Online:February 23, 2021

Cite as

Dmitry Grishchenko, Franck Iutzeler, Jérôme Malick (2021) Proximal Gradient Methods with Adaptive Subspace Sampling. Mathematics of Operations Research 46(4):1303-1323.

https://doi.org/10.1287/moor.2020.1092

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Proximal Gradient Methods with Adaptive Subspace Sampling

References

Volume 46, Issue 4

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News