[Paper] Reliable and Resilient Collective Communication Library for LLM Training and Serving

Published: (December 31, 2025 at 01:53 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.25059v1

Overview

Training and serving large language models (LLMs) now routinely involve dozens or even thousands of GPUs connected over high‑speed networks. A single network hiccup—such as a NIC failure or a transient link slowdown—can stall the whole job, costing 10–15 % of precious GPU time. The paper introduces R²CCL, a fault‑tolerant collective‑communication library that automatically reroutes traffic across multiple NICs, keeping training and inference pipelines alive with virtually no performance penalty.

Key Contributions

  • Lossless, low‑overhead failover: R²CCL leverages multi‑NIC hardware to migrate connections instantly when a NIC or link fails, avoiding costly job restarts.
  • Bandwidth‑aware load redistribution: The library continuously monitors link capacities and rebalances traffic to make the best use of the remaining healthy paths.
  • Resilient collective algorithms: Classic collective primitives (e.g., all‑reduce, broadcast) are re‑implemented to tolerate partial network partitions without sacrificing correctness.
  • Comprehensive evaluation: Experiments on two 8‑GPU H100 servers and large‑scale simulations (hundreds of GPUs) show < 1 % training overhead and < 3 % inference overhead under realistic failure patterns.
  • Significant speedup over prior art: R²CCL outperforms the closest open‑source solutions (AdapCC and DejaVu) by 12× and 47×, respectively, in fault‑recovery latency.

Methodology

  1. Multi‑NIC exploitation: Modern GPU servers often ship with several network interfaces (e.g., dual‑port InfiniBand). R²CCL registers all NICs with the MPI‑style runtime and treats them as interchangeable endpoints.
  2. Rapid connection migration: When a NIC reports an error, the library instantly tears down the affected sockets and re‑establishes them on a spare NIC, preserving in‑flight messages via a small per‑connection buffer.
  3. Dynamic bandwidth profiling: A lightweight background thread measures throughput on each link. If a link degrades, R²CCL redistributes collective traffic (e.g., splits an all‑reduce tree) to avoid the bottleneck.
  4. Resilient collectives: The authors redesign collective algorithms to be partition‑tolerant: if a subset of participants becomes temporarily unreachable, the algorithm proceeds with the remaining nodes and later merges the missing contributions once the failed path recovers.
  5. Simulation framework: To test scalability, the authors built a fault‑injection simulator that mimics GPU‑cluster topologies, varying failure rates, and network jitter, enabling reproducible stress tests beyond the two‑node hardware setup.

Results & Findings

ScenarioTraining overheadInference overheadRecovery latency (ms)
No fault (baseline)0 %0 %
Single NIC failure (R²CCL)0.8 %2.4 %≈ 12
Single NIC failure (AdapCC)9.6 %15.2 %145
Single NIC failure (DejaVu)38 %51 %560
  • Robustness: R²CCL kept training progress uninterrupted in > 99 % of simulated fault injections.
  • Scalability: In simulations of 256‑GPU clusters, the library’s overhead grew sub‑linearly, confirming that the extra bookkeeping does not become a bottleneck.
  • Resource efficiency: Because R²CCL reuses existing NICs rather than spawning extra processes or checkpointing the entire model, GPU memory and storage footprints remain unchanged.

Practical Implications

  • Reduced cloud costs: Cloud providers charge per GPU‑hour; cutting a 10 % waste translates directly into lower bills for LLM developers.
  • Higher SLA compliance: For inference services (e.g., chatbots), the ability to survive a NIC glitch without dropping requests improves latency guarantees and user experience.
  • Simplified ops: Engineers no longer need elaborate checkpoint‑and‑restart scripts for network failures; R²CCL handles recovery transparently, lowering operational complexity.
  • Hardware‑agnostic resilience: The approach works with any multi‑NIC server (InfiniBand, RoCE, Ethernet), making it a drop‑in upgrade for existing PyTorch/DeepSpeed pipelines.
  • Enables larger clusters: As clusters scale to thousands of GPUs, the probability of at least one network fault skyrockets; a library that mitigates that risk unlocks more aggressive scaling strategies.

Limitations & Future Work

  • Dependency on multiple NICs: Systems with a single network interface cannot benefit from R²CCL’s failover; the authors suggest exploring software‑based virtual NICs as a fallback.
  • Partial fault coverage: The current design assumes that at least one NIC per node remains functional; simultaneous multi‑NIC failures would still abort the job.
  • Integration depth: R²CCL is presented as a standalone library; tighter integration with popular frameworks (e.g., NCCL, Horovod) could reduce the learning curve.
  • Security considerations: Automatic reconnection across NICs may expose new attack surfaces; future work will harden the handshake protocol.

Authors

  • Wei Wang
  • Nengneng Yu
  • Sixian Xiong
  • Zaoxing Liu

Paper Information

  • arXiv ID: 2512.25059v1
  • Categories: cs.DC, cs.LG, cs.NI
  • Published: December 31, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »