[Paper] ResiHP: Taming LLM Training Failures with Dynamic Hybrid
Source: arXiv - 2605.06374v1
Overview
Training today’s massive language models (LLMs) relies on hybrid parallelism—splitting work across thousands of GPUs. When a single GPU hiccups, the whole training job can stall, and the problem is amplified by the natural variability in sequence lengths across the dataset. ResiHP introduces a lightweight, workload‑aware failure detector and a dynamic scheduler that together keep training humming even when hardware glitches occur, delivering up to a 4.4× speedup over existing resilient systems on a 256‑GPU cluster.
Key Contributions
- Accurate failure detection: A novel predictor separates genuine hardware failures from normal iteration‑time jitter caused by variable sequence lengths.
- Hybrid‑aware scheduling: Dynamically reshapes parallelism groups, model partitioning, and workload distribution on‑the‑fly to compensate for failed devices.
- Low‑overhead design: The detector runs online with negligible extra compute, making it practical for production‑scale training.
- Empirical validation: Experiments on a 256‑GPU cluster show 1.04–4.39× higher throughput across a range of simulated failure patterns compared with the state‑of‑the‑art resilient training frameworks.
Methodology
-
Workload‑aware execution‑time predictor
- Models the expected iteration time as a function of the current batch’s sequence‑length distribution.
- Uses a lightweight regression (e.g., linear or shallow neural net) trained on a short warm‑up period.
- When the observed iteration time deviates beyond a statistically‑derived confidence interval, the system flags a potential failure.
-
Dynamic Scheduler
- Parallelism group resizing: Shrinks or expands tensor‑model‑parallel and data‑parallel groups to bypass the faulty GPU(s).
- Model partition rebalancing: Re‑assigns model shards so that the remaining devices share the extra workload evenly.
- Workload‑aware batch slicing: Adjusts the mix of short and long sequences per device to keep iteration times balanced.
-
Integration loop
- The detector runs at every iteration, feeding its confidence score to the scheduler.
- Scheduler applies the minimal set of changes needed to restore target throughput, then the system continues training without a global restart.
Results & Findings
| Scenario | Baseline (no resilience) | Prior resilient system | ResiHP (throughput) |
|---|---|---|---|
| Single‑GPU stall (5 % slowdown) | 0.78× | 0.92× | 1.73× |
| Multiple‑GPU stalls (2‑3 GPUs) | 0.55× | 0.68× | 2.31× |
| High sequence‑length variance | 0.62× | 0.81× | 1.04× |
| Mixed failures + variance | 0.48× | 0.66× | 4.39× |
- Detection accuracy: > 96 % true‑positive rate, < 2 % false‑positive rate, even when iteration times swing by ±30 % due to long sequences.
- Overhead: Predictor adds < 0.5 % extra runtime; scheduler reconfiguration costs are amortized over subsequent iterations.
Practical Implications
- Higher GPU utilization: Data‑center operators can run LLM training jobs on larger clusters without fearing a single‑node failure to cripple the whole job.
- Cost savings: Fewer job restarts and less need for over‑provisioning translate directly into lower cloud‑compute bills.
- Simplified ops: The system’s online detection means engineers don’t need to manually monitor logs or intervene when a GPU glitches.
- Portability: Because ResiHP works at the level of parallelism groups, it can be dropped into existing PyTorch/DeepSpeed or Megatron‑LM pipelines with minimal code changes.
Limitations & Future Work
- Scope of failures: ResiHP currently handles performance degradations (slow GPUs) and outright stalls; it does not yet address silent bit‑flips or corrupted model parameters.
- Scalability beyond 256 GPUs: Experiments stop at 256 GPUs; the authors note that predictor accuracy may degrade when the number of parallel groups grows very large, requiring hierarchical detection.
- Dataset‑specific tuning: The predictor must be retrained for datasets with dramatically different length distributions (e.g., code vs. prose).
- Future directions: Extending the framework to multi‑node, heterogeneous clusters (mix of GPUs/TPUs), integrating fault‑tolerant checkpointing, and exploring reinforcement‑learning‑based scheduling policies.
Authors
- Tenghui Ma
- Jihu Guo
- Wei Gao
- Sitian Lu
- Zhisheng Ye
- Hanjing Wang
- Dahua Lin
Paper Information
- arXiv ID: 2605.06374v1
- Categories: cs.DC
- Published: May 7, 2026
- PDF: Download PDF