[Paper] Straggler Tolerant and Resilient DL Training on Homogeneous GPUs
Source: arXiv - 2512.09685v1
Overview
Deep‑learning practitioners have long relied on homogeneous GPU clusters to speed up model training, yet “stragglers” – slow workers that hold up the whole job – remain a hidden bottleneck. This paper uncovers why stragglers persist even in balanced GPU farms and introduces STAR (Straggler‑Tolerant And Resilient) – a system that dynamically picks the best synchronization strategy and reallocates CPU/bandwidth resources to keep training fast and accurate.
Key Contributions
- Empirical diagnosis of stragglers in homogeneous GPU setups, showing that CPU and network bandwidth imbalances are the primary culprits.
- Critical evaluation of existing mitigation (switching from synchronous SGD to asynchronous SGD), revealing that ASGD can worsen time‑to‑accuracy (TTA) and even create more stragglers.
- STAR system design:
- New group‑based synchronization modes that let subsets of workers update parameters together.
- A heuristic and a machine‑learning selector that automatically choose the optimal mode for any given workload.
- Resource‑aware allocation that throttles parameter‑server (PS) placement and gradient traffic to avoid overloading CPUs and network links.
- Trace‑driven evaluation on AWS demonstrating 48‑84 % (PS architecture) and 51‑70 % (all‑reduce architecture) reductions in TTA compared with state‑of‑the‑art baselines, while preserving final model accuracy.
- Open‑source release of the STAR codebase, enabling immediate experimentation.
Methodology
- Benchmark Suite & Instrumentation – Ran a battery of popular DL workloads (e.g., ResNet‑50, BERT) on homogeneous GPU clusters in AWS, instrumenting CPU, GPU, and network metrics to pinpoint where delays originated.
- Straggler Characterization – Correlated per‑iteration runtimes with CPU utilization and NIC bandwidth to quantify how often and why workers lagged behind.
- Design of Synchronization Modes – Instead of classic “all‑workers‑sync” (SSGD) or fully asynchronous (ASGD), STAR defines group‑sync modes where workers are partitioned into logical groups that synchronize internally before a global update.
- Mode Selection Engine –
- Heuristic: uses simple rules (e.g., if CPU > 80 % → shrink group size).
- ML Model: a lightweight regression trained on historical traces predicts TTA for each mode and picks the best.
- Resource‑Aware Scheduler – When a job requests PS instances, STAR evaluates current CPU/bandwidth headroom and may relocate PSs or throttle gradient traffic to keep the overall system balanced.
- Trace‑Driven Simulation – Real‑world traces from the AWS runs feed a simulator that evaluates STAR against baseline SSGD/ASGD under identical hardware conditions.
Results & Findings
| Architecture | Baseline (SSGD) TTA | STAR TTA | Improvement | Accuracy Impact |
|---|---|---|---|---|
| Parameter‑Server (PS) | 100 % (reference) | 48‑84 % of baseline | 48‑84 % faster | No loss (within 0.1 % of SSGD) |
| All‑Reduce | 100 % (reference) | 51‑70 % of baseline | 30‑49 % faster | No loss (within 0.1 % of SSGD) |
- Straggler frequency dropped from ~15 % of iterations to < 3 % after STAR’s resource rebalancing.
- ASGD performed worse than SSGD in 70 % of the tested scenarios, confirming the authors’ hypothesis that higher resource consumption offsets any latency gains.
- The ML selector outperformed the heuristic by ~5 % in TTA reduction, while still being fast enough to run online.
Practical Implications
- For Cloud‑Hosted Training – Companies running large‑scale DL jobs on services like AWS, Azure, or GCP can plug STAR into their existing pipelines to shave days off training cycles without buying extra GPUs.
- Cost Savings – Faster TTA translates directly into lower compute‑hour bills; the reported 50 % reduction could halve the cost of a typical BERT pre‑training run.
- Co‑Location Friendly – STAR’s CPU/bandwidth‑aware allocation means you can safely share nodes with other workloads (e.g., data preprocessing) without causing interference.
- Simplified Ops – The automatic mode selector removes the need for engineers to manually tune sync/async settings for each model or cluster size.
- Open‑Source Integration – Since the code is publicly available, it can be integrated with popular DL frameworks (TensorFlow, PyTorch) via a thin wrapper, making adoption painless.
Limitations & Future Work
- Homogeneous GPU focus – Assumes identical GPU models; heterogeneous clusters (mix of V100, A100, etc.) may exhibit different straggler patterns.
- Static Resource Profiles – Scheduler relies on relatively stable CPU/bandwidth baselines; highly bursty workloads could still cause unexpected stalls.
- Scalability to thousands of nodes – Experiments capped at a few hundred GPUs; extending the group‑sync logic and ML selector to massive clusters remains an open challenge.
- Future directions include extending STAR to heterogeneous hardware, incorporating more sophisticated network‑aware grouping (e.g., topology‑aware), and exploring reinforcement‑learning‑based mode selection for continuously evolving workloads.
Authors
- Zeyu Zhang
- Haiying Shen
Paper Information
- arXiv ID: 2512.09685v1
- Categories: cs.DC
- Published: December 10, 2025
- PDF: Download PDF