[Paper] ResiHP: Taming LLM Training Failures with Dynamic Hybrid

Published: (May 7, 2026 at 10:52 AM EDT)
4 min read
Source: arXiv

Source: arXiv - 2605.06374v1

Overview

Training today’s massive language models (LLMs) relies on hybrid parallelism—splitting work across thousands of GPUs. When a single GPU hiccups, the whole training job can stall, and the problem is amplified by the natural variability in sequence lengths across the dataset. ResiHP introduces a lightweight, workload‑aware failure detector and a dynamic scheduler that together keep training humming even when hardware glitches occur, delivering up to a 4.4× speedup over existing resilient systems on a 256‑GPU cluster.

Key Contributions

  • Accurate failure detection: A novel predictor separates genuine hardware failures from normal iteration‑time jitter caused by variable sequence lengths.
  • Hybrid‑aware scheduling: Dynamically reshapes parallelism groups, model partitioning, and workload distribution on‑the‑fly to compensate for failed devices.
  • Low‑overhead design: The detector runs online with negligible extra compute, making it practical for production‑scale training.
  • Empirical validation: Experiments on a 256‑GPU cluster show 1.04–4.39× higher throughput across a range of simulated failure patterns compared with the state‑of‑the‑art resilient training frameworks.

Methodology

  1. Workload‑aware execution‑time predictor

    • Models the expected iteration time as a function of the current batch’s sequence‑length distribution.
    • Uses a lightweight regression (e.g., linear or shallow neural net) trained on a short warm‑up period.
    • When the observed iteration time deviates beyond a statistically‑derived confidence interval, the system flags a potential failure.
  2. Dynamic Scheduler

    • Parallelism group resizing: Shrinks or expands tensor‑model‑parallel and data‑parallel groups to bypass the faulty GPU(s).
    • Model partition rebalancing: Re‑assigns model shards so that the remaining devices share the extra workload evenly.
    • Workload‑aware batch slicing: Adjusts the mix of short and long sequences per device to keep iteration times balanced.
  3. Integration loop

    • The detector runs at every iteration, feeding its confidence score to the scheduler.
    • Scheduler applies the minimal set of changes needed to restore target throughput, then the system continues training without a global restart.

Results & Findings

ScenarioBaseline (no resilience)Prior resilient systemResiHP (throughput)
Single‑GPU stall (5 % slowdown)0.78×0.92×1.73×
Multiple‑GPU stalls (2‑3 GPUs)0.55×0.68×2.31×
High sequence‑length variance0.62×0.81×1.04×
Mixed failures + variance0.48×0.66×4.39×
  • Detection accuracy: > 96 % true‑positive rate, < 2 % false‑positive rate, even when iteration times swing by ±30 % due to long sequences.
  • Overhead: Predictor adds < 0.5 % extra runtime; scheduler reconfiguration costs are amortized over subsequent iterations.

Practical Implications

  • Higher GPU utilization: Data‑center operators can run LLM training jobs on larger clusters without fearing a single‑node failure to cripple the whole job.
  • Cost savings: Fewer job restarts and less need for over‑provisioning translate directly into lower cloud‑compute bills.
  • Simplified ops: The system’s online detection means engineers don’t need to manually monitor logs or intervene when a GPU glitches.
  • Portability: Because ResiHP works at the level of parallelism groups, it can be dropped into existing PyTorch/DeepSpeed or Megatron‑LM pipelines with minimal code changes.

Limitations & Future Work

  • Scope of failures: ResiHP currently handles performance degradations (slow GPUs) and outright stalls; it does not yet address silent bit‑flips or corrupted model parameters.
  • Scalability beyond 256 GPUs: Experiments stop at 256 GPUs; the authors note that predictor accuracy may degrade when the number of parallel groups grows very large, requiring hierarchical detection.
  • Dataset‑specific tuning: The predictor must be retrained for datasets with dramatically different length distributions (e.g., code vs. prose).
  • Future directions: Extending the framework to multi‑node, heterogeneous clusters (mix of GPUs/TPUs), integrating fault‑tolerant checkpointing, and exploring reinforcement‑learning‑based scheduling policies.

Authors

  • Tenghui Ma
  • Jihu Guo
  • Wei Gao
  • Sitian Lu
  • Zhisheng Ye
  • Hanjing Wang
  • Dahua Lin

Paper Information

  • arXiv ID: 2605.06374v1
  • Categories: cs.DC
  • Published: May 7, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »