Distributed Stochastic Optimization with Large Delays

Zhengyuan Zhou
Zhengyuan Zhou
[email protected]
https://orcid.org/0000-0002-0005-9411
Stern School of Business, New York University, New York, New York 10012;
Search for more papers by this author
,
Panayotis Mertikopoulos
Panayotis Mertikopoulos
[email protected]
https://orcid.org/0000-0003-2026-9616
Univ. Grenoble Alpes, CNRS, Inria, LIG, 38000 Grenoble, France;
Search for more papers by this author
,
Nicholas Bambos
Nicholas Bambos
[email protected]
Department of Management Science and Engineering, Stanford University, Stanford, California 94305
Search for more papers by this author
,
Peter Glynn
Peter Glynn
[email protected]
Department of Management Science and Engineering, Stanford University, Stanford, California 94305
Search for more papers by this author
,
Yinyu Ye
Yinyu Ye
[email protected]
Department of Management Science and Engineering, Stanford University, Stanford, California 94305
Search for more papers by this author

Stern School of Business, New York University, New York, New York 10012;

Search for more papers by this author

Panayotis Mertikopoulos

[email protected]

https://orcid.org/0000-0003-2026-9616

Univ. Grenoble Alpes, CNRS, Inria, LIG, 38000 Grenoble, France;

Search for more papers by this author

Nicholas Bambos

[email protected]

Department of Management Science and Engineering, Stanford University, Stanford, California 94305

Search for more papers by this author

Peter Glynn

[email protected]

Department of Management Science and Engineering, Stanford University, Stanford, California 94305

Search for more papers by this author

Yinyu Ye

[email protected]

Department of Management Science and Engineering, Stanford University, Stanford, California 94305

Search for more papers by this author

Published Online:1 Dec 2021https://doi.org/10.1287/moor.2021.1200

References

[1] Agarwal A, Duchi JC (2011) Distributed delayed stochastic optimization. Shawe-Taylor J, Zemel RS, Bartlett PL, Pereira F, Weinberger KQ, eds. Advances in Neural Information Processing Systems 24 (Curran Associates), 873–881.Google Scholar
[2] Avron H, Druinsky A, Gupta A (2015) Revisiting asynchronous linear solvers: Provable convergence rate through randomization. J. ACM 62(6):51.Crossref, Google Scholar
[3] Benaïm M (1999) Dynamics of stochastic approximation algorithms. Azéma J, Émery M, Ledoux M, Yor M, eds. Séminaire de Probabilités XXXIII, Lecture Notes in Mathematics, vol. 1709 (Springer, Berlin), 1–68.Crossref, Google Scholar
[4] Benaïm M, Hirsch MW (1996) Asymptotic pseudotrajectories and chain recurrent flows, with applications. J. Dynamic Differential Equations 8(1):141–176.Crossref, Google Scholar
[5] Benaïm M, Schreiber SJ (2000) Ergodic properties of weak asymptotic pseudotrajectories for semiflows. J. Dynamic Differential Equations 12(3):579–598.Crossref, Google Scholar
[6] Bertsekas DP (1995) Nonlinear Programming (Athena Scientific, Nashua, NH).Google Scholar
[7] Bertsekas DP, Tsitsiklis JN (1996) Neuro-Dynamic Programming (Athena Scientific, Nashua, NH).Google Scholar
[8] Bertsekas DP, Tsitsiklis JN (1997) Parallel and Distributed Computation: Numerical Methods (Athena Scientific, Nashua, NH).Google Scholar
[9] Borkar VS (2008) Stochastic Approximation: A Dynamical Systems Viewpoint (Cambridge University Press, Cambridge, UK, and Hindustan Book Agency, New Delhi, India).Crossref, Google Scholar
[10] Carter M (2001) Foundations of Mathematical Economics (MIT Press, Cambridge, MA).Google Scholar
[11] Chaturapruek S, Duchi JC, Ré C (2015) Asynchronous stochastic convex optimization: The noise is in the noise and SGD don’t care. Adv. Neural Inform. Processing Systems 1531–1539.Google Scholar
[12] de Souza PN, Silva JN (2012) Berkeley Problems in Mathematics. Problem Books in Mathematics (Springer, New York).Google Scholar
[13] Dean J, Corrado GS, Monga R, Chen K, Devin M, Le QV, Mao MZ, et al.. (2012) Large scale distributed deep networks. Proc. 25th Internat. Conf. Neural Information Processing Systems, vol. 1 (Curran Associates), 1223–1231.Google Scholar
[14] Dean J, Corrado G, Monga R, Chen K, Devin M, Mao M, Senior A, et al. (2012) Large scale distributed deep networks. Adv. Neural Inform. Processing Systems 1223–1231.Google Scholar
[15] Dey SS, Molinaro M, Wang Q (2017) Analysis of sparse cutting planes for sparse milps with applications to stochastic milps. Math. Oper. Res. 43(1):304–332.Link, Google Scholar
[16] Feng J, Xu H, Yan S (2013) Online robust PCA via stochastic optimization. Adv. Neural Inform. Processing Systems 26(1):404–412.Google Scholar
[17] Fercoq O, Richtárik P (2015) Accelerated, parallel, and proximal coordinate descent. SIAM J. Optim. 25(4):1997–2023.Crossref, Google Scholar
[18] Feyzmahdavian HR, Aytekin A, Johansson M (2016) An asynchronous mini-batch algorithm for regularized stochastic optimization. IEEE Trans. Automated Control 61(12):3740–3754.Crossref, Google Scholar
[19] Ge R, Huang F, Jin C, Yuan Y (2015) Escaping from saddle points? Online stochastic gradient for tensor decomposition. Proc. Conf. Learn. Theory, 40:797–842.Google Scholar
[20] Ghadimi S, Lan G (2013) Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4):2341–2368.Crossref, Google Scholar
[21] Hall P, Heyde CC (1980) Martingale Limit Theory and Its Application. Probability and Mathematical Statistics (Academic Press, New York).Google Scholar
[22] Héliou A, Mertikopoulos P, Zhou Z (2020) Gradient-free online learning in continuous games with delayed rewards. Proc. 37th Internat. Conf. Machine Learn. 119:4172–4181.Google Scholar
[23] Hong M (2017) A distributed, asynchronous, and incremental algorithm for nonconvex optimization: An ADMM approach. IEEE Trans. Control Network Systems 5(3):935–945.Google Scholar
[24] Hsieh Y-G, Iutzeler F, Malick J, Mertikopoulos P (2020) Multi-agent online optimization with delays: Asynchronicity, adaptivity, and optimism. Preprint, submitted December 21, 2020, https://arxiv.org/abs/2012.11579.Google Scholar
[25] Jin C, Ge R, Netrapalli P, Kakade SM, Jordan MI (2017) How to escape saddle points efficiently. Proc. 34th Internat. Conf. Machine Learn. vol. 70, 1724–1732.Google Scholar
[26] Joulani P, György A, Szepesvári C (2016) Delay-tolerant online convex optimization: Unified analysis and adaptive-gradient algorithms. Proc. 30th Conf. Artificial Intelligence 1744–1750.Google Scholar
[27] Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Pereira F, Burges CJC, Bottou L, Weinberger KQ, eds. Advances in Neural Information Processing Systems 25 (Curran Associates), 1097–1105.Google Scholar
[28] Kushner H, Yin GG (2013) Stochastic Approximation and Recursive Algorithms and Applications. Stochastic Modelling and Applied Probability (Springer, New York).Google Scholar
[29] Lee J, Hasenbein J, Morton D (2011) Stochastic optimization models for rapid detection of viruses in cellphone networks. Oper. Res. Lett. 43(1):59–64.Google Scholar
[30] Lee JD, Simchowitz M, Jordan MI, Recht B (2016) Gradient descent converges to minimizers. 29th Annual Conference on Learning Theory (PMLR) 49:1246–1257.Google Scholar
[31] Lian X, Huang Y, Li Y, Liu J (2015) Asynchronous parallel stochastic gradient for nonconvex optimization. Adv. Neural Inform. Processing Systems 28:2737–2745.Google Scholar
[32] Lian X, Zhang H, Hsieh C-J, Huang Y, Liu J (2016) A comprehensive linear speedup analysis for asynchronous stochastic parallel optimization from zeroth-order to first-order. Adv. Neural Inform. Processing Systems 29:3054–3062.Google Scholar
[33] Liu J, Wright SJ (2015) Asynchronous stochastic coordinate descent: Parallelism and convergence properties. SIAM J. Optim. 25(1):351–376.Crossref, Google Scholar
[34] Liu J, Wright SJ, Sridhar S (2014) An asynchronous parallel randomized kaczmarz algorithm. Preprint, submitted January 20, 2014, https://arxiv.org/abs/1401.4780.Google Scholar
[35] Mania H, Pan X, Papailiopoulos D, Recht B, Ramchandran K, Jordan MI (2017) Perturbed iterate analysis for asynchronous stochastic optimization. SIAM J. Optim. 27(4):2202–2229.Crossref, Google Scholar
[36] Marecek J, Richtarik P, Takac M (2015) Distributed block coordinate descent for minimizing partially separable functions. Numerical Analysis and Optimization (Springer, Berlin), 261–288.Crossref, Google Scholar
[37] Nesterov Y (2004) Introductory Lectures on Convex Optimization: A Basic Course (Kluwer Academic Publishers, Norwell, MA).Crossref, Google Scholar
[38] Paine T, Jin H, Yang J, Lin Z, Huang T (2013) GPU asynchronous stochastic gradient descent to speed up neural network training. Preprint, submitted December 21, 2013, https://arxiv.org/abs/1312.6186.Google Scholar
[39] Quanrud K, Khashabi D (2015) Online learning with adversarial delays. Proc. 29th Internat. Conf. Neural Information Processing Systems 1270–1278.Google Scholar
[40] Recht B, Re C, Wright S, Niu F (2011) Hogwild: A lock-free approach to parallelizing stochastic gradient descent. Adv. Neural Inform. Processing Systems 24:693–701.Google Scholar
[41] Rosenbrock HH (1960) An automatic method for finding the greatest or least value of a function. Comput. J. 3(3):175–184.Crossref, Google Scholar
[42] Ruszczyński A, Shapiro A (2003) Stochastic programming models. Handbook Oper. Res. Management Sci. 10:1–64.Google Scholar
[43] Shapiro A, Philpott A (2007) A tutorial on stochastic programming.Google Scholar
[44] Tappenden R, Takac M, Richtarik P (2017) On the complexity of parallel coordinate descent. Optim. Methods Software 33(2):1–24.Google Scholar
[45] Tran K, Hosseini S, Xiao L, Finley T, Bilenko M (2015) Scaling up stochastic dual coordinate ascent. Proc. 21th ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (ACM, New York), 1185–1194.Google Scholar
[46] Tsitsiklis JN (1984) Problems in decentralized decision making and computation. PhD thesis, MIT, Cambridge, MA.Google Scholar
[47] Tsitsiklis JN, Bertsekas DP, Athans M (1986) Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE Transactions on Automatic Control 31(9):803–812.Google Scholar
[48] Uryasev S, Pardalos PM (2010) Stochastic Optimization: Algorithms and Applications (Springer, New York).Google Scholar
[49] Wang Y-X, Sadhanala V, Dai W, Neiswanger W, Sra S, Xing E (2016) Parallel and distributed block-coordinate frank-wolfe algorithms. Proc. Internat. Conf. Machine Learn. 48:1548–1557.Google Scholar
[50] Wright SJ (2015) Coordinate descent algorithms. Math. Programming 151(1):3–34.Crossref, Google Scholar
[51] Zhang R, Kwok J (2014) Asynchronous distributed admm for consensus optimization. Proc. Internat. Conf. Machine Learn. 32(2):1701–1709.Google Scholar
[52] Zhang S, Choromanska AE, LeCun Y (2015) Deep learning with elastic averaging sgd. Adv. Neural Inform. Processing Systems 28:685–693.Google Scholar
[53] Zhang S, Zhang C, You Z, Zheng R, Xu B (2013) Asynchronous stochastic gradient descent for DNN training. Proc. IEEE Internat. Conf. Acoustics Speech Signal Processing 6660–6663.Google Scholar
[54] Zhou Z, Mertikopoulos P, Bambos N, Boyd S, Glynn PW (2017) Stochastic mirror descent in variationally coherent optimization problems. Adv. Neural Inform. Processing Systems 7040–7049.Google Scholar
[55] Zhou Z, Mertikopoulos P, Bambos N, Boyd SP, Glynn PW (2020) On the convergence of mirror descent beyond stochastic convex programming. SIAM J. Optim. 30(1):687–716.Crossref, Google Scholar
[56] Zhou Z, Mertikopoulos P, Bambos N, Glynn P, Ye Y, Li L-J, Li F-F (2018) Distributed asynchronous optimization with unbounded delays: How slow can you go? Proc. Internat. Conf. Machine Learn., 5970–5979.Google Scholar

cover image Mathematics of Operations Research

Volume 47, Issue 3

August 2022

Pages 1707-2545, C2

Article Information

Metrics

Information

Received:June 15, 2018
Accepted:April 07, 2021
Published Online:December 01, 2021

Cite as

Zhengyuan Zhou, Panayotis Mertikopoulos, Nicholas Bambos, Peter Glynn, Yinyu Ye (2021) Distributed Stochastic Optimization with Large Delays. Mathematics of Operations Research 47(3):2082-2111.

https://doi.org/10.1287/moor.2021.1200

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Distributed Stochastic Optimization with Large Delays

References

Volume 47, Issue 3

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News