[Paper] Clipping Makes Distributed and Federated Asynchronous SGD Robust to Stragglers

Published: 3 days ago (June 11, 2026 at 08:43 AM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.13287v1

Overview

In modern machine learning, parallelization of training is an important strategy for increasing scale. Asynchronous stochastic gradient descent (ASGD), which maximizes the utilization of available hardware by avoiding waiting for slow workers. However, with constant step sizes, the convergence of ASGD is nonetheless affected negatively by slow workers due to large delays in updates. At the same time, it has been empirically observed in asynchronous training of deep learning models that gradient clipping “stabilizes” training. In this work, we provide a theoretical justification for this behavior, as we show that clipping removes the dependence of the maximum delay in the oracle complexity. We employ a sub-Weibull model of gradient noise which generalizes sub-Gaussian and sub-exponential distributions to more heavy-tailed distributions, motivated by empirical observations in deep learning. We show convergence in expectation, and the first time in asynchronous optimization, convergence with high probability.

Key Contributions

This paper presents research in the following areas:

cs.LG
cs.DC
math.OC

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.LG.

Authors

Samuel Erickson
Mikael Johansson

Paper Information

arXiv ID: 2606.13287v1
Categories: cs.LG, cs.DC, math.OC
Published: June 11, 2026
PDF: Download PDF

[Paper] Clipping Makes Distributed and Federated Asynchronous SGD Robust to Stragglers

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

[Paper] Mana: Dexterous Manipulation of Articulated Tools

[Paper] SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

[Paper] Understanding Truncated Positional Encodings for Graph Neural Networks