[Paper] Reliable and Resilient Collective Communication Library for LLM Training and Serving
Source: arXiv - 2512.25059v1
Overview
Training and serving large language models (LLMs) now routinely involve dozens or even thousands of GPUs connected over high‑speed networks. A single network hiccup—such as a NIC failure or a transient link slowdown—can stall the whole job, costing 10–15 % of precious GPU time. The paper introduces R²CCL, a fault‑tolerant collective‑communication library that automatically reroutes traffic across multiple NICs, keeping training and inference pipelines alive with virtually no performance penalty.
Key Contributions
- Lossless, low‑overhead failover: R²CCL leverages multi‑NIC hardware to migrate connections instantly when a NIC or link fails, avoiding costly job restarts.
- Bandwidth‑aware load redistribution: The library continuously monitors link capacities and rebalances traffic to make the best use of the remaining healthy paths.
- Resilient collective algorithms: Classic collective primitives (e.g., all‑reduce, broadcast) are re‑implemented to tolerate partial network partitions without sacrificing correctness.
- Comprehensive evaluation: Experiments on two 8‑GPU H100 servers and large‑scale simulations (hundreds of GPUs) show < 1 % training overhead and < 3 % inference overhead under realistic failure patterns.
- Significant speedup over prior art: R²CCL outperforms the closest open‑source solutions (AdapCC and DejaVu) by 12× and 47×, respectively, in fault‑recovery latency.
Methodology
- Multi‑NIC exploitation: Modern GPU servers often ship with several network interfaces (e.g., dual‑port InfiniBand). R²CCL registers all NICs with the MPI‑style runtime and treats them as interchangeable endpoints.
- Rapid connection migration: When a NIC reports an error, the library instantly tears down the affected sockets and re‑establishes them on a spare NIC, preserving in‑flight messages via a small per‑connection buffer.
- Dynamic bandwidth profiling: A lightweight background thread measures throughput on each link. If a link degrades, R²CCL redistributes collective traffic (e.g., splits an all‑reduce tree) to avoid the bottleneck.
- Resilient collectives: The authors redesign collective algorithms to be partition‑tolerant: if a subset of participants becomes temporarily unreachable, the algorithm proceeds with the remaining nodes and later merges the missing contributions once the failed path recovers.
- Simulation framework: To test scalability, the authors built a fault‑injection simulator that mimics GPU‑cluster topologies, varying failure rates, and network jitter, enabling reproducible stress tests beyond the two‑node hardware setup.
Results & Findings
| Scenario | Training overhead | Inference overhead | Recovery latency (ms) |
|---|---|---|---|
| No fault (baseline) | 0 % | 0 % | – |
| Single NIC failure (R²CCL) | 0.8 % | 2.4 % | ≈ 12 |
| Single NIC failure (AdapCC) | 9.6 % | 15.2 % | 145 |
| Single NIC failure (DejaVu) | 38 % | 51 % | 560 |
- Robustness: R²CCL kept training progress uninterrupted in > 99 % of simulated fault injections.
- Scalability: In simulations of 256‑GPU clusters, the library’s overhead grew sub‑linearly, confirming that the extra bookkeeping does not become a bottleneck.
- Resource efficiency: Because R²CCL reuses existing NICs rather than spawning extra processes or checkpointing the entire model, GPU memory and storage footprints remain unchanged.
Practical Implications
- Reduced cloud costs: Cloud providers charge per GPU‑hour; cutting a 10 % waste translates directly into lower bills for LLM developers.
- Higher SLA compliance: For inference services (e.g., chatbots), the ability to survive a NIC glitch without dropping requests improves latency guarantees and user experience.
- Simplified ops: Engineers no longer need elaborate checkpoint‑and‑restart scripts for network failures; R²CCL handles recovery transparently, lowering operational complexity.
- Hardware‑agnostic resilience: The approach works with any multi‑NIC server (InfiniBand, RoCE, Ethernet), making it a drop‑in upgrade for existing PyTorch/DeepSpeed pipelines.
- Enables larger clusters: As clusters scale to thousands of GPUs, the probability of at least one network fault skyrockets; a library that mitigates that risk unlocks more aggressive scaling strategies.
Limitations & Future Work
- Dependency on multiple NICs: Systems with a single network interface cannot benefit from R²CCL’s failover; the authors suggest exploring software‑based virtual NICs as a fallback.
- Partial fault coverage: The current design assumes that at least one NIC per node remains functional; simultaneous multi‑NIC failures would still abort the job.
- Integration depth: R²CCL is presented as a standalone library; tighter integration with popular frameworks (e.g., NCCL, Horovod) could reduce the learning curve.
- Security considerations: Automatic reconnection across NICs may expose new attack surfaces; future work will harden the handshake protocol.
Authors
- Wei Wang
- Nengneng Yu
- Sixian Xiong
- Zaoxing Liu
Paper Information
- arXiv ID: 2512.25059v1
- Categories: cs.DC, cs.LG, cs.NI
- Published: December 31, 2025
- PDF: Download PDF