[Paper] ReviveMoE: Fast Recovery for Hardware Failures in Large-Scale MoE LLM Inference Deployments
Source: arXiv - 2602.21140v1
Overview
Large language model (LLM) services are increasingly deployed across dozens or hundreds of machines, making hardware failures inevitable. The paper ReviveMoE introduces a lightweight recovery mechanism that restores service instantly—without the costly “cold‑restart” of the whole model—targeting Mixture‑of‑Experts (MoE) LLMs that dominate today’s high‑throughput inference workloads.
Key Contributions
- Fast, in‑place failure recovery for MoE‑based LLM inference that avoids reloading model weights or recompiling graphs.
- Unified support for both collocated (MoE and attention on the same node) and disaggregated (MoE separated from attention) serving architectures.
- Integration with production stack: built on Huawei Cloud’s xDeepServe serving platform and the XCCL communication library, demonstrating real‑world viability.
- Quantitative speedup: recovery latency reduced from tens of seconds (full restart) to sub‑second or low‑millisecond range in large‑scale deployments.
- Minimal impact on request latency: the recovery path runs concurrently with normal inference, keeping tail‑latency guarantees intact.
Methodology
- State checkpointing – Critical runtime metadata (e.g., routing tables, expert load statistics, and communication contexts) are periodically snap‑shotted in a lock‑free manner.
- Hot‑swap expert replicas – When a node hosting a subset of experts fails, a standby replica on another machine is activated. The routing logic is updated on‑the‑fly using the latest checkpoint.
- Graceful request draining – In‑flight requests targeting the failed node are rerouted to healthy replicas; new requests are automatically directed to the standby set via an updated hash‑based router.
- Communication layer adaptation – XCCL’s fault‑tolerant collective primitives are leveraged to re‑establish all‑reduce and broadcast channels without tearing down the entire graph.
- Compatibility layer – For collocated deployments, the same mechanism simply bypasses the attention sub‑graph, while for disaggregated setups it re‑links the attention workers to the revived expert workers.
The approach is implemented as a thin middleware on top of xDeepServe, requiring no changes to the underlying model code or the training pipeline.
Results & Findings
| Metric | Traditional Restart | ReviveMoE (Hot‑Swap) |
|---|---|---|
| Mean recovery time | 12–45 s (depends on model size) | 0.8 s (≈ 1 % of restart time) |
| 99‑th‑percentile request latency during failure | Spike up to 5× normal latency | < 1.2× normal latency |
| Throughput loss | 30–60 % drop while reloading | < 5 % drop (mostly due to rerouting) |
| Memory overhead | None (but full reload) | ~8 % extra for standby replicas |
The authors evaluated ReviveMoE on a 128‑GPU MoE LLM deployment (≈ 300 B parameters) serving tens of thousands of requests per second. Across simulated node failures, the system maintained SLA‑grade latency and recovered in under a second, confirming that the hot‑swap path scales linearly with the number of experts.
Practical Implications
- SLA‑level reliability: Cloud providers can guarantee sub‑second recovery for LLM inference services, a key differentiator for enterprise customers.
- Cost savings: Eliminating full model reloads cuts compute waste and reduces the need for over‑provisioned standby clusters.
- Simplified ops: Operators no longer need complex orchestration scripts to “drain and restart” MoE workers; the middleware handles it automatically.
- Developer ergonomics: Existing MoE models can be deployed unchanged—ReviveMoE works as a plug‑in to the serving stack, lowering the barrier to adopt fault‑tolerant inference.
- Edge & hybrid clouds: The same technique can be applied to disaggregated setups where MoE experts run on specialized accelerators (e.g., TPUs) while attention runs on CPUs/GPUs, enabling robust multi‑cloud deployments.
Limitations & Future Work
- Hardware dependency: The current prototype relies on Huawei’s XCCL library and xDeepServe; porting to other ecosystems (e.g., NVIDIA NCCL, Ray Serve) will require additional engineering.
- Standby replica cost: Maintaining hot‑standby expert copies incurs a modest memory overhead; future work could explore dynamic replica scaling based on failure probability.
- Scope limited to MoE: While MoE dominates large LLMs today, the approach does not directly address dense transformer deployments; extending the hot‑swap concept to generic attention layers is an open direction.
- Failure modes: The paper focuses on single‑node failures; handling correlated failures (e.g., rack‑level power loss) or network partitions remains future work.
Overall, ReviveMoE offers a pragmatic, production‑ready path to make massive MoE LLM inference services resilient, paving the way for more reliable AI‑as‑a‑service offerings.
Authors
- Haley Li
- Xinglu Wang
- Cong Feng
- Chunxu Zuo
- Yanan Wang
- Hei Lo
- Yufei Cui
- Bingji Wang
- Duo Cui
- Shuming Jing
- Yizhou Shan
- Ying Xiong
- Jiannan Wang
- Yong Zhang
- Zhenan Fan
Paper Information
- arXiv ID: 2602.21140v1
- Categories: cs.DC
- Published: February 24, 2026
- PDF: Download PDF