[Paper] ReviveMoE: Fast Recovery for Hardware Failures in Large-Scale MoE LLM Inference Deployments

Published: (February 24, 2026 at 12:39 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.21140v1

Overview

Large language model (LLM) services are increasingly deployed across dozens or hundreds of machines, making hardware failures inevitable. The paper ReviveMoE introduces a lightweight recovery mechanism that restores service instantly—without the costly “cold‑restart” of the whole model—targeting Mixture‑of‑Experts (MoE) LLMs that dominate today’s high‑throughput inference workloads.

Key Contributions

  • Fast, in‑place failure recovery for MoE‑based LLM inference that avoids reloading model weights or recompiling graphs.
  • Unified support for both collocated (MoE and attention on the same node) and disaggregated (MoE separated from attention) serving architectures.
  • Integration with production stack: built on Huawei Cloud’s xDeepServe serving platform and the XCCL communication library, demonstrating real‑world viability.
  • Quantitative speedup: recovery latency reduced from tens of seconds (full restart) to sub‑second or low‑millisecond range in large‑scale deployments.
  • Minimal impact on request latency: the recovery path runs concurrently with normal inference, keeping tail‑latency guarantees intact.

Methodology

  1. State checkpointing – Critical runtime metadata (e.g., routing tables, expert load statistics, and communication contexts) are periodically snap‑shotted in a lock‑free manner.
  2. Hot‑swap expert replicas – When a node hosting a subset of experts fails, a standby replica on another machine is activated. The routing logic is updated on‑the‑fly using the latest checkpoint.
  3. Graceful request draining – In‑flight requests targeting the failed node are rerouted to healthy replicas; new requests are automatically directed to the standby set via an updated hash‑based router.
  4. Communication layer adaptation – XCCL’s fault‑tolerant collective primitives are leveraged to re‑establish all‑reduce and broadcast channels without tearing down the entire graph.
  5. Compatibility layer – For collocated deployments, the same mechanism simply bypasses the attention sub‑graph, while for disaggregated setups it re‑links the attention workers to the revived expert workers.

The approach is implemented as a thin middleware on top of xDeepServe, requiring no changes to the underlying model code or the training pipeline.

Results & Findings

MetricTraditional RestartReviveMoE (Hot‑Swap)
Mean recovery time12–45 s (depends on model size)0.8 s (≈ 1 % of restart time)
99‑th‑percentile request latency during failureSpike up to 5× normal latency< 1.2× normal latency
Throughput loss30–60 % drop while reloading< 5 % drop (mostly due to rerouting)
Memory overheadNone (but full reload)~8 % extra for standby replicas

The authors evaluated ReviveMoE on a 128‑GPU MoE LLM deployment (≈ 300 B parameters) serving tens of thousands of requests per second. Across simulated node failures, the system maintained SLA‑grade latency and recovered in under a second, confirming that the hot‑swap path scales linearly with the number of experts.

Practical Implications

  • SLA‑level reliability: Cloud providers can guarantee sub‑second recovery for LLM inference services, a key differentiator for enterprise customers.
  • Cost savings: Eliminating full model reloads cuts compute waste and reduces the need for over‑provisioned standby clusters.
  • Simplified ops: Operators no longer need complex orchestration scripts to “drain and restart” MoE workers; the middleware handles it automatically.
  • Developer ergonomics: Existing MoE models can be deployed unchanged—ReviveMoE works as a plug‑in to the serving stack, lowering the barrier to adopt fault‑tolerant inference.
  • Edge & hybrid clouds: The same technique can be applied to disaggregated setups where MoE experts run on specialized accelerators (e.g., TPUs) while attention runs on CPUs/GPUs, enabling robust multi‑cloud deployments.

Limitations & Future Work

  • Hardware dependency: The current prototype relies on Huawei’s XCCL library and xDeepServe; porting to other ecosystems (e.g., NVIDIA NCCL, Ray Serve) will require additional engineering.
  • Standby replica cost: Maintaining hot‑standby expert copies incurs a modest memory overhead; future work could explore dynamic replica scaling based on failure probability.
  • Scope limited to MoE: While MoE dominates large LLMs today, the approach does not directly address dense transformer deployments; extending the hot‑swap concept to generic attention layers is an open direction.
  • Failure modes: The paper focuses on single‑node failures; handling correlated failures (e.g., rack‑level power loss) or network partitions remains future work.

Overall, ReviveMoE offers a pragmatic, production‑ready path to make massive MoE LLM inference services resilient, paving the way for more reliable AI‑as‑a‑service offerings.

Authors

  • Haley Li
  • Xinglu Wang
  • Cong Feng
  • Chunxu Zuo
  • Yanan Wang
  • Hei Lo
  • Yufei Cui
  • Bingji Wang
  • Duo Cui
  • Shuming Jing
  • Yizhou Shan
  • Ying Xiong
  • Jiannan Wang
  • Yong Zhang
  • Zhenan Fan

Paper Information

  • arXiv ID: 2602.21140v1
  • Categories: cs.DC
  • Published: February 24, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »