[Paper] ResiHP: Taming LLM Training Failures with Dynamic Hybrid

Published: 3 days ago (May 7, 2026 at 10:52 AM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.06374v1

Overview

Training today’s massive language models (LLMs) relies on hybrid parallelism—splitting work across thousands of GPUs. When a single GPU hiccups, the whole training job can stall, and the problem is amplified by the natural variability in sequence lengths across the dataset. ResiHP introduces a lightweight, workload‑aware failure detector and a dynamic scheduler that together keep training humming even when hardware glitches occur, delivering up to a 4.4× speedup over existing resilient systems on a 256‑GPU cluster.

Key Contributions

Accurate failure detection: A novel predictor separates genuine hardware failures from normal iteration‑time jitter caused by variable sequence lengths.
Hybrid‑aware scheduling: Dynamically reshapes parallelism groups, model partitioning, and workload distribution on‑the‑fly to compensate for failed devices.
Low‑overhead design: The detector runs online with negligible extra compute, making it practical for production‑scale training.
Empirical validation: Experiments on a 256‑GPU cluster show 1.04–4.39× higher throughput across a range of simulated failure patterns compared with the state‑of‑the‑art resilient training frameworks.

Methodology

Workload‑aware execution‑time predictor
- Models the expected iteration time as a function of the current batch’s sequence‑length distribution.
- Uses a lightweight regression (e.g., linear or shallow neural net) trained on a short warm‑up period.
- When the observed iteration time deviates beyond a statistically‑derived confidence interval, the system flags a potential failure.
Dynamic Scheduler
- Parallelism group resizing: Shrinks or expands tensor‑model‑parallel and data‑parallel groups to bypass the faulty GPU(s).
- Model partition rebalancing: Re‑assigns model shards so that the remaining devices share the extra workload evenly.
- Workload‑aware batch slicing: Adjusts the mix of short and long sequences per device to keep iteration times balanced.
Integration loop
- The detector runs at every iteration, feeding its confidence score to the scheduler.
- Scheduler applies the minimal set of changes needed to restore target throughput, then the system continues training without a global restart.

Results & Findings

Scenario	Baseline (no resilience)	Prior resilient system	ResiHP (throughput)
Single‑GPU stall (5 % slowdown)	0.78×	0.92×	1.73×
Multiple‑GPU stalls (2‑3 GPUs)	0.55×	0.68×	2.31×
High sequence‑length variance	0.62×	0.81×	1.04×
Mixed failures + variance	0.48×	0.66×	4.39×

Detection accuracy: > 96 % true‑positive rate, < 2 % false‑positive rate, even when iteration times swing by ±30 % due to long sequences.
Overhead: Predictor adds < 0.5 % extra runtime; scheduler reconfiguration costs are amortized over subsequent iterations.

Practical Implications

Higher GPU utilization: Data‑center operators can run LLM training jobs on larger clusters without fearing a single‑node failure to cripple the whole job.
Cost savings: Fewer job restarts and less need for over‑provisioning translate directly into lower cloud‑compute bills.
Simplified ops: The system’s online detection means engineers don’t need to manually monitor logs or intervene when a GPU glitches.
Portability: Because ResiHP works at the level of parallelism groups, it can be dropped into existing PyTorch/DeepSpeed or Megatron‑LM pipelines with minimal code changes.

Limitations & Future Work

Scope of failures: ResiHP currently handles performance degradations (slow GPUs) and outright stalls; it does not yet address silent bit‑flips or corrupted model parameters.
Scalability beyond 256 GPUs: Experiments stop at 256 GPUs; the authors note that predictor accuracy may degrade when the number of parallel groups grows very large, requiring hierarchical detection.
Dataset‑specific tuning: The predictor must be retrained for datasets with dramatically different length distributions (e.g., code vs. prose).
Future directions: Extending the framework to multi‑node, heterogeneous clusters (mix of GPUs/TPUs), integrating fault‑tolerant checkpointing, and exploring reinforcement‑learning‑based scheduling policies.

Authors

Tenghui Ma
Jihu Guo
Wei Gao
Sitian Lu
Zhisheng Ye
Hanjing Wang
Dahua Lin

Paper Information

arXiv ID: 2605.06374v1
Categories: cs.DC
Published: May 7, 2026
PDF: Download PDF

[Paper] ResiHP: Taming LLM Training Failures with Dynamic Hybrid

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Stencil Computations on Cerebras Wafer-Scale Engine

[Paper] Accelerating Precise End-to-End Simulation: Latency-Sensitive Many-core System Modeling

[Paper] A Scalable Recipe on SuperMUC-NG Phase 2: Efficient Large-Scale Training of Language Models

[Paper] Stencil Computations on Tenstorrent Wormhole