[Paper] ReviveMoE: Fast Recovery for Hardware Failures in Large-Scale MoE LLM Inference Deployments

Published: 3 days ago (February 24, 2026 at 12:39 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.21140v1

Overview

Large language model (LLM) services are increasingly deployed across dozens or hundreds of machines, making hardware failures inevitable. The paper ReviveMoE introduces a lightweight recovery mechanism that restores service instantly—without the costly “cold‑restart” of the whole model—targeting Mixture‑of‑Experts (MoE) LLMs that dominate today’s high‑throughput inference workloads.

Key Contributions

Fast, in‑place failure recovery for MoE‑based LLM inference that avoids reloading model weights or recompiling graphs.
Unified support for both collocated (MoE and attention on the same node) and disaggregated (MoE separated from attention) serving architectures.
Integration with production stack: built on Huawei Cloud’s xDeepServe serving platform and the XCCL communication library, demonstrating real‑world viability.
Quantitative speedup: recovery latency reduced from tens of seconds (full restart) to sub‑second or low‑millisecond range in large‑scale deployments.
Minimal impact on request latency: the recovery path runs concurrently with normal inference, keeping tail‑latency guarantees intact.

Methodology

State checkpointing – Critical runtime metadata (e.g., routing tables, expert load statistics, and communication contexts) are periodically snap‑shotted in a lock‑free manner.
Hot‑swap expert replicas – When a node hosting a subset of experts fails, a standby replica on another machine is activated. The routing logic is updated on‑the‑fly using the latest checkpoint.
Graceful request draining – In‑flight requests targeting the failed node are rerouted to healthy replicas; new requests are automatically directed to the standby set via an updated hash‑based router.
Communication layer adaptation – XCCL’s fault‑tolerant collective primitives are leveraged to re‑establish all‑reduce and broadcast channels without tearing down the entire graph.
Compatibility layer – For collocated deployments, the same mechanism simply bypasses the attention sub‑graph, while for disaggregated setups it re‑links the attention workers to the revived expert workers.

The approach is implemented as a thin middleware on top of xDeepServe, requiring no changes to the underlying model code or the training pipeline.

Results & Findings

Metric	Traditional Restart	ReviveMoE (Hot‑Swap)
Mean recovery time	12–45 s (depends on model size)	0.8 s (≈ 1 % of restart time)
99‑th‑percentile request latency during failure	Spike up to 5× normal latency	< 1.2× normal latency
Throughput loss	30–60 % drop while reloading	< 5 % drop (mostly due to rerouting)
Memory overhead	None (but full reload)	~8 % extra for standby replicas

The authors evaluated ReviveMoE on a 128‑GPU MoE LLM deployment (≈ 300 B parameters) serving tens of thousands of requests per second. Across simulated node failures, the system maintained SLA‑grade latency and recovered in under a second, confirming that the hot‑swap path scales linearly with the number of experts.

Practical Implications

SLA‑level reliability: Cloud providers can guarantee sub‑second recovery for LLM inference services, a key differentiator for enterprise customers.
Cost savings: Eliminating full model reloads cuts compute waste and reduces the need for over‑provisioned standby clusters.
Simplified ops: Operators no longer need complex orchestration scripts to “drain and restart” MoE workers; the middleware handles it automatically.
Developer ergonomics: Existing MoE models can be deployed unchanged—ReviveMoE works as a plug‑in to the serving stack, lowering the barrier to adopt fault‑tolerant inference.
Edge & hybrid clouds: The same technique can be applied to disaggregated setups where MoE experts run on specialized accelerators (e.g., TPUs) while attention runs on CPUs/GPUs, enabling robust multi‑cloud deployments.

Limitations & Future Work

Hardware dependency: The current prototype relies on Huawei’s XCCL library and xDeepServe; porting to other ecosystems (e.g., NVIDIA NCCL, Ray Serve) will require additional engineering.
Standby replica cost: Maintaining hot‑standby expert copies incurs a modest memory overhead; future work could explore dynamic replica scaling based on failure probability.
Scope limited to MoE: While MoE dominates large LLMs today, the approach does not directly address dense transformer deployments; extending the hot‑swap concept to generic attention layers is an open direction.
Failure modes: The paper focuses on single‑node failures; handling correlated failures (e.g., rack‑level power loss) or network partitions remains future work.

Overall, ReviveMoE offers a pragmatic, production‑ready path to make massive MoE LLM inference services resilient, paving the way for more reliable AI‑as‑a‑service offerings.

Authors

Haley Li
Xinglu Wang
Cong Feng
Chunxu Zuo
Yanan Wang
Hei Lo
Yufei Cui
Bingji Wang
Duo Cui
Shuming Jing
Yizhou Shan
Ying Xiong
Jiannan Wang
Yong Zhang
Zhenan Fan

Paper Information

arXiv ID: 2602.21140v1
Categories: cs.DC
Published: February 24, 2026
PDF: Download PDF

[Paper] ReviveMoE: Fast Recovery for Hardware Failures in Large-Scale MoE LLM Inference Deployments

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Exploiting network topology in brain-scale simulations of spiking neural networks

[Paper] STELLAR: Storage Tuning Engine Leveraging LLM Autonomous Reasoning for High Performance Parallel File Systems

[Paper] A High-Throughput AES-GCM Implementation on GPUs for Secure, Policy-Based Access to Massive Astronomical Catalogs

[Paper] A Simple Distributed Deterministic Planar Separator