[Paper] Straggler Tolerant and Resilient DL Training on Homogeneous GPUs

Published: (December 10, 2025 at 09:31 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.09685v1

Overview

Deep‑learning practitioners have long relied on homogeneous GPU clusters to speed up model training, yet “stragglers” – slow workers that hold up the whole job – remain a hidden bottleneck. This paper uncovers why stragglers persist even in balanced GPU farms and introduces STAR (Straggler‑Tolerant And Resilient) – a system that dynamically picks the best synchronization strategy and reallocates CPU/bandwidth resources to keep training fast and accurate.

Key Contributions

  • Empirical diagnosis of stragglers in homogeneous GPU setups, showing that CPU and network bandwidth imbalances are the primary culprits.
  • Critical evaluation of existing mitigation (switching from synchronous SGD to asynchronous SGD), revealing that ASGD can worsen time‑to‑accuracy (TTA) and even create more stragglers.
  • STAR system design:
    • New group‑based synchronization modes that let subsets of workers update parameters together.
    • A heuristic and a machine‑learning selector that automatically choose the optimal mode for any given workload.
    • Resource‑aware allocation that throttles parameter‑server (PS) placement and gradient traffic to avoid overloading CPUs and network links.
  • Trace‑driven evaluation on AWS demonstrating 48‑84 % (PS architecture) and 51‑70 % (all‑reduce architecture) reductions in TTA compared with state‑of‑the‑art baselines, while preserving final model accuracy.
  • Open‑source release of the STAR codebase, enabling immediate experimentation.

Methodology

  1. Benchmark Suite & Instrumentation – Ran a battery of popular DL workloads (e.g., ResNet‑50, BERT) on homogeneous GPU clusters in AWS, instrumenting CPU, GPU, and network metrics to pinpoint where delays originated.
  2. Straggler Characterization – Correlated per‑iteration runtimes with CPU utilization and NIC bandwidth to quantify how often and why workers lagged behind.
  3. Design of Synchronization Modes – Instead of classic “all‑workers‑sync” (SSGD) or fully asynchronous (ASGD), STAR defines group‑sync modes where workers are partitioned into logical groups that synchronize internally before a global update.
  4. Mode Selection Engine
    • Heuristic: uses simple rules (e.g., if CPU > 80 % → shrink group size).
    • ML Model: a lightweight regression trained on historical traces predicts TTA for each mode and picks the best.
  5. Resource‑Aware Scheduler – When a job requests PS instances, STAR evaluates current CPU/bandwidth headroom and may relocate PSs or throttle gradient traffic to keep the overall system balanced.
  6. Trace‑Driven Simulation – Real‑world traces from the AWS runs feed a simulator that evaluates STAR against baseline SSGD/ASGD under identical hardware conditions.

Results & Findings

ArchitectureBaseline (SSGD) TTASTAR TTAImprovementAccuracy Impact
Parameter‑Server (PS)100 % (reference)48‑84 % of baseline48‑84 % fasterNo loss (within 0.1 % of SSGD)
All‑Reduce100 % (reference)51‑70 % of baseline30‑49 % fasterNo loss (within 0.1 % of SSGD)
  • Straggler frequency dropped from ~15 % of iterations to < 3 % after STAR’s resource rebalancing.
  • ASGD performed worse than SSGD in 70 % of the tested scenarios, confirming the authors’ hypothesis that higher resource consumption offsets any latency gains.
  • The ML selector outperformed the heuristic by ~5 % in TTA reduction, while still being fast enough to run online.

Practical Implications

  • For Cloud‑Hosted Training – Companies running large‑scale DL jobs on services like AWS, Azure, or GCP can plug STAR into their existing pipelines to shave days off training cycles without buying extra GPUs.
  • Cost Savings – Faster TTA translates directly into lower compute‑hour bills; the reported 50 % reduction could halve the cost of a typical BERT pre‑training run.
  • Co‑Location Friendly – STAR’s CPU/bandwidth‑aware allocation means you can safely share nodes with other workloads (e.g., data preprocessing) without causing interference.
  • Simplified Ops – The automatic mode selector removes the need for engineers to manually tune sync/async settings for each model or cluster size.
  • Open‑Source Integration – Since the code is publicly available, it can be integrated with popular DL frameworks (TensorFlow, PyTorch) via a thin wrapper, making adoption painless.

Limitations & Future Work

  • Homogeneous GPU focus – Assumes identical GPU models; heterogeneous clusters (mix of V100, A100, etc.) may exhibit different straggler patterns.
  • Static Resource Profiles – Scheduler relies on relatively stable CPU/bandwidth baselines; highly bursty workloads could still cause unexpected stalls.
  • Scalability to thousands of nodes – Experiments capped at a few hundred GPUs; extending the group‑sync logic and ML selector to massive clusters remains an open challenge.
  • Future directions include extending STAR to heterogeneous hardware, incorporating more sophisticated network‑aware grouping (e.g., topology‑aware), and exploring reinforcement‑learning‑based mode selection for continuously evolving workloads.

Authors

  • Zeyu Zhang
  • Haiying Shen

Paper Information

  • arXiv ID: 2512.09685v1
  • Categories: cs.DC
  • Published: December 10, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »