[Paper] Straggler Tolerant and Resilient DL Training on Homogeneous GPUs

Published: 2 months ago (December 10, 2025 at 09:31 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.09685v1

Overview

Deep‑learning practitioners have long relied on homogeneous GPU clusters to speed up model training, yet “stragglers” – slow workers that hold up the whole job – remain a hidden bottleneck. This paper uncovers why stragglers persist even in balanced GPU farms and introduces STAR (Straggler‑Tolerant And Resilient) – a system that dynamically picks the best synchronization strategy and reallocates CPU/bandwidth resources to keep training fast and accurate.

Key Contributions

Empirical diagnosis of stragglers in homogeneous GPU setups, showing that CPU and network bandwidth imbalances are the primary culprits.
Critical evaluation of existing mitigation (switching from synchronous SGD to asynchronous SGD), revealing that ASGD can worsen time‑to‑accuracy (TTA) and even create more stragglers.
STAR system design:
- New group‑based synchronization modes that let subsets of workers update parameters together.
- A heuristic and a machine‑learning selector that automatically choose the optimal mode for any given workload.
- Resource‑aware allocation that throttles parameter‑server (PS) placement and gradient traffic to avoid overloading CPUs and network links.
Trace‑driven evaluation on AWS demonstrating 48‑84 % (PS architecture) and 51‑70 % (all‑reduce architecture) reductions in TTA compared with state‑of‑the‑art baselines, while preserving final model accuracy.
Open‑source release of the STAR codebase, enabling immediate experimentation.

Methodology

Benchmark Suite & Instrumentation – Ran a battery of popular DL workloads (e.g., ResNet‑50, BERT) on homogeneous GPU clusters in AWS, instrumenting CPU, GPU, and network metrics to pinpoint where delays originated.
Straggler Characterization – Correlated per‑iteration runtimes with CPU utilization and NIC bandwidth to quantify how often and why workers lagged behind.
Design of Synchronization Modes – Instead of classic “all‑workers‑sync” (SSGD) or fully asynchronous (ASGD), STAR defines group‑sync modes where workers are partitioned into logical groups that synchronize internally before a global update.
Mode Selection Engine –
- Heuristic: uses simple rules (e.g., if CPU > 80 % → shrink group size).
- ML Model: a lightweight regression trained on historical traces predicts TTA for each mode and picks the best.
Resource‑Aware Scheduler – When a job requests PS instances, STAR evaluates current CPU/bandwidth headroom and may relocate PSs or throttle gradient traffic to keep the overall system balanced.
Trace‑Driven Simulation – Real‑world traces from the AWS runs feed a simulator that evaluates STAR against baseline SSGD/ASGD under identical hardware conditions.

Results & Findings

Architecture	Baseline (SSGD) TTA	STAR TTA	Improvement	Accuracy Impact
Parameter‑Server (PS)	100 % (reference)	48‑84 % of baseline	48‑84 % faster	No loss (within 0.1 % of SSGD)
All‑Reduce	100 % (reference)	51‑70 % of baseline	30‑49 % faster	No loss (within 0.1 % of SSGD)

Straggler frequency dropped from ~15 % of iterations to < 3 % after STAR’s resource rebalancing.
ASGD performed worse than SSGD in 70 % of the tested scenarios, confirming the authors’ hypothesis that higher resource consumption offsets any latency gains.
The ML selector outperformed the heuristic by ~5 % in TTA reduction, while still being fast enough to run online.

Practical Implications

For Cloud‑Hosted Training – Companies running large‑scale DL jobs on services like AWS, Azure, or GCP can plug STAR into their existing pipelines to shave days off training cycles without buying extra GPUs.
Cost Savings – Faster TTA translates directly into lower compute‑hour bills; the reported 50 % reduction could halve the cost of a typical BERT pre‑training run.
Co‑Location Friendly – STAR’s CPU/bandwidth‑aware allocation means you can safely share nodes with other workloads (e.g., data preprocessing) without causing interference.
Simplified Ops – The automatic mode selector removes the need for engineers to manually tune sync/async settings for each model or cluster size.
Open‑Source Integration – Since the code is publicly available, it can be integrated with popular DL frameworks (TensorFlow, PyTorch) via a thin wrapper, making adoption painless.

Limitations & Future Work

Homogeneous GPU focus – Assumes identical GPU models; heterogeneous clusters (mix of V100, A100, etc.) may exhibit different straggler patterns.
Static Resource Profiles – Scheduler relies on relatively stable CPU/bandwidth baselines; highly bursty workloads could still cause unexpected stalls.
Scalability to thousands of nodes – Experiments capped at a few hundred GPUs; extending the group‑sync logic and ML selector to massive clusters remains an open challenge.
Future directions include extending STAR to heterogeneous hardware, incorporating more sophisticated network‑aware grouping (e.g., topology‑aware), and exploring reinforcement‑learning‑based mode selection for continuously evolving workloads.

Authors

Zeyu Zhang
Haiying Shen

Paper Information

arXiv ID: 2512.09685v1
Categories: cs.DC
Published: December 10, 2025
PDF: Download PDF

[Paper] Straggler Tolerant and Resilient DL Training on Homogeneous GPUs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Hypergraph based Multi-Party Payment Channel

[Paper] Stateless Snowflake: A Cloud-Agnostic Distributed ID Generator Using Network-Derived Identity

[Paper] FirecREST v2: lessons learned from redesigning an API for scalable HPC resource access

[Paper] Enhanced Pruning for Distributed Closeness Centrality under Multi-Packet Messaging