Tackling Arbitrarily Heterogeneous Data in Asynchronous Stochastic Gradient Descent Without Worker Scheduling

Published Online:https://doi.org/10.1287/ijoc.2025.1443

References

  • Agarwal A, Duchi JC (2011) Distributed delayed stochastic optimization. Shawe-Taylor J, Zemel RS, Bartlett PL, Pereira F, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems, vol. 24 (Curran Associates, Red Hook, NY), 873–881.Google Scholar
  • Arjevani Y, Shamir O, Srebro N (2020) A tight convergence analysis for stochastic gradient descent with delayed updates. Kontorovich A, Neu G, eds. Proc. 31st Internat. Conf. Algorithmic Learn. Theory, Proceedings of Machine Learning Research, vol. 117 (PMLR, New York), 111–132.Google Scholar
  • Arjevani Y, Carmon Y, Duchi JC, Foster DJ, Srebro N, Woodworth B (2023) Lower bounds for non-convex stochastic optimization. Math. Programming 199(1):165–214.CrossrefGoogle Scholar
  • Assran M, Aytekin A, Feyzmahdavian HR, Johansson M, Rabbat MG (2020) Advances in asynchronous parallel and distributed optimization. Proc. IEEE 108(11):2013–2031.CrossrefGoogle Scholar
  • Avdiukhin D, Kasiviswanathan S (2021) Federated learning under arbitrary communication patterns. Meila M, Zhang T, eds. Proc. 38th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 139 (PMLR, New York), 425–435.Google Scholar
  • Blatt D, Hero AO, Gauchman H (2007) A convergent incremental gradient method with a constant step size. SIAM J. Optim. 18(1):29–51.CrossrefGoogle Scholar
  • Chen Y, Ning Y, Slawski M, Rangwala H (2020) Asynchronous online federated learning for edge devices with non-IID data. Proc. 2020 IEEE Internat. Conf. Big Data (IEEE, Piscataway, NJ), 15–24.Google Scholar
  • Cotter A, Shamir O, Srebro N, Sridharan K (2011) Better mini-batch algorithms via accelerated gradient methods. Shawe-Taylor J, Zemel RS, Bartlett PL, Pereira F, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems, vol. 24 (Curran Associates, Red Hook, NY), 1647–1655.Google Scholar
  • De Sa CM, Zhang C, Olukotun K, Ré C (2015) Taming the wild: A unified analysis of Hogwild-style algorithms. Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 28 (Curran Associates, Red Hook, NY), 2674–2682.Google Scholar
  • Defazio A, Bach F, Lacoste-Julien S (2014) SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems, vol. 27 (Curran Associates, Red Hook, NY), 1646–1654.Google Scholar
  • Dekel O, Gilad-Bachrach R, Shamir O, Xiao L (2012) Optimal distributed online prediction using mini-batches. J. Machine Learn. Res. 13(6):165–202.Google Scholar
  • Dutta S, Wang J, Joshi G (2021) Slow and stale gradients can win the race. IEEE J. Selected Areas Inform. Theory 2(3):1012–1024.CrossrefGoogle Scholar
  • Feyzmahdavian HR, Aytekin A, Johansson M (2016) An asynchronous mini-batch algorithm for regularized stochastic optimization. IEEE Trans. Automatic Control 61(12):3740–3754.CrossrefGoogle Scholar
  • Fraboni Y, Vidal R, Kameni L, Lorenzi M (2023) A general theory for federated optimization with asynchronous and heterogeneous clients updates. J. Machine Learn. Res. 24(110):1–43.Google Scholar
  • Gao H, Wu G, Rossi R (2021) Provable distributed stochastic gradient descent with delayed updates. Domeniconi C, Davidson I, eds. Proc. 2021 SIAM Internat. Conf. Data Mining (SIAM, Philadelphia), 441–449.Google Scholar
  • Glasgow MR, Wootters M (2022) Asynchronous distributed optimization with stochastic delays. Camps-Valls G, Ruiz FJR, Valera I, eds. Proc. 25th Internat. Conf. Artificial Intelligence Statist., vol. 151 (PMLR, New York), 9247–9279.Google Scholar
  • Gu X, Huang K, Zhang J, Huang L (2021) Fast federated learning in the presence of arbitrary device unavailability. Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, eds. Adv. Neural Inform. Processing Systems, vol. 34 (Curran Associates, Red Hook, NY), 12052–12064.Google Scholar
  • Gurbuzbalaban M, Ozdaglar A, Parrilo PA (2017) On the convergence rate of incremental aggregated gradient algorithms. SIAM J. Optim. 27(2):1035–1048.CrossrefGoogle Scholar
  • Islamov R, Safaryan M, Alistarh D (2024) AsGrad: A sharp unified analysis of asynchronous-SGD algorithms. Dasgupta S, Mandt S, Li Y, eds. Proc. 27th Internat. Conf. Artificial Intelligence Statist., Proceedings of Machine Learning Research, vol. 238 (PMLR, New York), 649–657.Google Scholar
  • Karimireddy SP, Kale S, Mohri M, Reddi S, Stich S, Suresh AT (2020) Scaffold: Stochastic controlled averaging for federated learning. Daumé H III, Singh A, eds. Proc. 37th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 119 (PMLR, New York), 5132–5143.Google Scholar
  • Khaled A, Richtárik P (2023) Better theory for SGD in the nonconvex world. Trans. Machine Learn. Res. https://openreview.net/forum?id=AU4qHN2VkS.Google Scholar
  • Koloskova A, Stich SU, Jaggi M (2022) Sharper convergence guarantees for asynchronous SGD for distributed and federated learning. Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, eds. Adv. Neural Inform. Processing Systems, vol. 35 (Curran Associates, Red Hook, NY), 17202–17215.Google Scholar
  • Koloskova A, Loizou N, Boreiri S, Jaggi M, Stich S (2020) A unified theory of decentralized SGD with changing topology and local updates. Daumé H III, Singh A, eds. Proc. 37th Internat. Conf. Machine Learn., vol. 119 (PMLR, New York), 5381–5393.Google Scholar
  • Krizhevsky A (2009) Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto, Toronto.Google Scholar
  • Leblond R, Pedregosa F, Lacoste-Julien S (2018) Improved asynchronous parallel optimization analysis for stochastic incremental methods. J. Machine Learn. Res. 19(81):1–68.Google Scholar
  • Leconte L, Nguyen VM, Moulines E (2024) FAVANO: Federated averaging with asynchronous nodes. Proc. 2024 IEEE Internat. Conf. Acoustics Speech Signal Processing (IEEE, Piscataway, NJ), 5665–5669.Google Scholar
  • Leconte L, Jonckheere M, Samsonov S, Moulines E (2024) Queuing dynamics of asynchronous federated learning. Dasgupta S, Mandt S, Li Y, eds. Proc. 27th Internat. Conf. Artificial Intelligence Statist., vol. 238 (PMLR, New York), 1711–1719.Google Scholar
  • Li Q, Diao Y, Chen Q, He B (2022) Federated learning on non-IID data silos: An experimental study. Proc. 2022 IEEE 38th Internat. Conf. Data Engrg. (IEEE, Piscataway, NJ), 965–978.Google Scholar
  • Lian X, Huang Y, Li Y, Liu J (2015) Asynchronous parallel stochastic gradient for nonconvex optimization. Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 28 (Curran Associates, Red Hook, NY), 2737–2745.Google Scholar
  • Lian X, Zhang W, Zhang C, Liu J (2018) Asynchronous decentralized parallel stochastic gradient descent. Dy J, Krause A, eds. Proc. 35th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 80 (PMLR, New York), 3043–3052.Google Scholar
  • Mania H, Pan X, Papailiopoulos D, Recht B, Ramchandran K, Jordan MI (2017) Perturbed iterate analysis for asynchronous stochastic optimization. SIAM J. Optim. 27(4):2202–2229.CrossrefGoogle Scholar
  • Mishchenko K, Bach F, Even M, Woodworth BE (2022) Asynchronous SGD beats minibatch SGD under arbitrary delays. Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, eds. Adv. Neural Inform. Processing Systems, vol. 35 (Curran Associates, Red Hook, NY), 420–433.Google Scholar
  • Nedić A, Bertsekas DP, Borkar VS (2001) Distributed asynchronous incremental subgradient methods. Stud. Comput. Math. 8(C):381–407.Google Scholar
  • Nguyen J, Malik K, Zhan H, Yousefpour A, Rabbat M, Malek M, Huba D (2022) Federated learning with buffered asynchronous aggregation. Camps-Valls G, Ruiz FJR, Valera I, eds. Proc. 25th Internat. Conf. Artificial Intelligence Statist., Proceedings of Machine Learning Research, vol. 151 (PMLR, New York), 3581–3607.Google Scholar
  • Recht B, Re C, Wright S, Niu F (2011) Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. Shawe-Taylor J, Zemel RS, Bartlett PL, Pereira F, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems, vol. 24 (Curran Associates, Red Hook, NY), 693–701.Google Scholar
  • Roux N, Schmidt M, Bach F (2012) A stochastic gradient method with an exponential convergence rate for finite training sets. Pereira F, Burges CJC, Bottou L, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems, vol. 25 (Curran Associates, Red Hook, NY), 2672–2680.Google Scholar
  • Schmidt M, Le Roux N, Bach F (2017) Minimizing finite sums with the stochastic average gradient. Math. Programming 162(1–2):83–112.CrossrefGoogle Scholar
  • Stich SU, Karimireddy SP (2020) The error-feedback framework: SGD with delayed gradients. J. Machine Learn. Res. 21(237):1–36.Google Scholar
  • Sun Y, Mao Y, Zhang J (2024) MimiC: Combating client dropouts in federated learning by mimicking central updates. IEEE Trans. Mobile Comput. 23(7):7572–7584.CrossrefGoogle Scholar
  • Toghani MT, Uribe CA (2022) Unbounded gradients in federated learning with buffered asynchronous aggregation. Proc. 2022 58th Annual Allerton Conf. Comm. Control Comput. (IEEE, Piscataway, NJ), 1–8.Google Scholar
  • Tyurin A, Richtárik P (2023) Optimal time complexities of parallel stochastic optimization methods under a fixed computation model. Oh A, Naumann T, Globerson A, Saenko K, Hardt M, Levine S, eds. Adv. Neural Inform. Processing Systems, vol. 36 (Curran Associates, Red Hook, NY), 16515–16577.Google Scholar
  • Vanli ND, Gürbüzbalaban M, Ozdaglar A (2018) Global convergence rate of proximal incremental aggregated gradient methods. SIAM J. Optim. 28(2):1282–1300.CrossrefGoogle Scholar
  • Verbraeken J, Wolting M, Katzy J, Kloppenburg J, Verbelen T, Rellermeyer JS (2020) A survey on distributed machine learning. ACM Comput. Surveys 53(2):30.Google Scholar
  • Wang X, Jin C, Wai HT, Gu Y (2023a) Linear speedup of incremental aggregated gradient methods on streaming data. Proc. 2023 62nd IEEE Conf. Decision Control (IEEE, Piscataway, NJ), 4314–4319.Google Scholar
  • Wang X, Li Z, Jin S, Zhang J (2025) Achieving linear speedup in asynchronous federated learning with heterogeneous clients. IEEE Trans. Mobile Comput. 24(1):435–448.CrossrefGoogle Scholar
  • Wang X, Sun Y, Wai H-T, Zhang J (2026) Tackling arbitrarily heterogeneous data in asynchronous stochastic gradient descent without worker scheduling. https://doi.org/10.1287/ijoc.2025.1443.cd, https://github.com/INFORMSJoC/2025.1443.Google Scholar
  • Wang Y, Cao Y, Wu J, Chen R, Chen J (2023b) Tackling the data heterogeneity in asynchronous federated learning with cached update calibration. Proc. 12th Internat. Conf. Learn. Representations (OpenReview.net).Google Scholar
  • Wang J, Liu Q, Liang H, Joshi G, Poor HV (2020) Tackling the objective inconsistency problem in heterogeneous federated optimization. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Red Hook, NY), 7611–7623.Google Scholar
  • Xiao H, Rasul K, Vollgraf R (2017) Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. Preprint, submitted August 25, https://arxiv.org/abs/1708.07747.Google Scholar
  • Xie C, Koyejo S, Gupta I (2020) Asynchronous federated optimization. 12th Annual Workshop Optim. Machine Learn. (OPT 2020).Google Scholar
  • Xu C, Qu Y, Xiang Y, Gao L (2023) Asynchronous federated learning on heterogeneous devices: A survey. Comput. Sci. Rev. 50(C):100595.CrossrefGoogle Scholar
  • Yan Y, Niu C, Ding Y, Zheng Z, Tang S, Li Q, Wu F, Lyu C, Feng Y, Chen G (2024) Federated optimization under intermittent client availability. INFORMS J. Comput. 36(1):185–202.LinkGoogle Scholar
  • Yang Q, Liu Y, Chen T, Tong Y (2019) Federated machine learning: Concept and applications. ACM Trans. Intelligent Systems Tech. 10(2):12.CrossrefGoogle Scholar
  • Yurochkin M, Agarwal M, Ghosh S, Greenewald K, Hoang N, Khazaeni Y (2019) Bayesian nonparametric federated learning of neural networks. Chaudhuri K, Salakhutdinov R, eds. Proc. 36th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 97 (PMLR, New York), 7252–7261.Google Scholar
  • Zhang F, Liu X, Lin S, Wu G, Zhou X, Jiang J, Ji X (2023) No one idles: Efficient heterogeneous federated learning with parallel edge and server computation. Krause A, Brunskill E, Cho K, Engelhardt B, Sabato S, Scarlett J, eds. Proc. 40th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 202 (PMLR, New York), 41399–41413.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.