[Paper] Reliable and Resilient Collective Communication Library for LLM Training and Serving

Published: 4 months ago (December 31, 2025 at 01:53 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.25059v1

Overview

Training and serving large language models (LLMs) now routinely involve dozens or even thousands of GPUs connected over high‑speed networks. A single network hiccup—such as a NIC failure or a transient link slowdown—can stall the whole job, costing 10–15 % of precious GPU time. The paper introduces R²CCL, a fault‑tolerant collective‑communication library that automatically reroutes traffic across multiple NICs, keeping training and inference pipelines alive with virtually no performance penalty.

Key Contributions

Lossless, low‑overhead failover: R²CCL leverages multi‑NIC hardware to migrate connections instantly when a NIC or link fails, avoiding costly job restarts.
Bandwidth‑aware load redistribution: The library continuously monitors link capacities and rebalances traffic to make the best use of the remaining healthy paths.
Resilient collective algorithms: Classic collective primitives (e.g., all‑reduce, broadcast) are re‑implemented to tolerate partial network partitions without sacrificing correctness.
Comprehensive evaluation: Experiments on two 8‑GPU H100 servers and large‑scale simulations (hundreds of GPUs) show < 1 % training overhead and < 3 % inference overhead under realistic failure patterns.
Significant speedup over prior art: R²CCL outperforms the closest open‑source solutions (AdapCC and DejaVu) by 12× and 47×, respectively, in fault‑recovery latency.

Methodology

Multi‑NIC exploitation: Modern GPU servers often ship with several network interfaces (e.g., dual‑port InfiniBand). R²CCL registers all NICs with the MPI‑style runtime and treats them as interchangeable endpoints.
Rapid connection migration: When a NIC reports an error, the library instantly tears down the affected sockets and re‑establishes them on a spare NIC, preserving in‑flight messages via a small per‑connection buffer.
Dynamic bandwidth profiling: A lightweight background thread measures throughput on each link. If a link degrades, R²CCL redistributes collective traffic (e.g., splits an all‑reduce tree) to avoid the bottleneck.
Resilient collectives: The authors redesign collective algorithms to be partition‑tolerant: if a subset of participants becomes temporarily unreachable, the algorithm proceeds with the remaining nodes and later merges the missing contributions once the failed path recovers.
Simulation framework: To test scalability, the authors built a fault‑injection simulator that mimics GPU‑cluster topologies, varying failure rates, and network jitter, enabling reproducible stress tests beyond the two‑node hardware setup.

Results & Findings

Scenario	Training overhead	Inference overhead	Recovery latency (ms)
No fault (baseline)	0 %	0 %	–
Single NIC failure (R²CCL)	0.8 %	2.4 %	≈ 12
Single NIC failure (AdapCC)	9.6 %	15.2 %	145
Single NIC failure (DejaVu)	38 %	51 %	560

Robustness: R²CCL kept training progress uninterrupted in > 99 % of simulated fault injections.
Scalability: In simulations of 256‑GPU clusters, the library’s overhead grew sub‑linearly, confirming that the extra bookkeeping does not become a bottleneck.
Resource efficiency: Because R²CCL reuses existing NICs rather than spawning extra processes or checkpointing the entire model, GPU memory and storage footprints remain unchanged.

Practical Implications

Reduced cloud costs: Cloud providers charge per GPU‑hour; cutting a 10 % waste translates directly into lower bills for LLM developers.
Higher SLA compliance: For inference services (e.g., chatbots), the ability to survive a NIC glitch without dropping requests improves latency guarantees and user experience.
Simplified ops: Engineers no longer need elaborate checkpoint‑and‑restart scripts for network failures; R²CCL handles recovery transparently, lowering operational complexity.
Hardware‑agnostic resilience: The approach works with any multi‑NIC server (InfiniBand, RoCE, Ethernet), making it a drop‑in upgrade for existing PyTorch/DeepSpeed pipelines.
Enables larger clusters: As clusters scale to thousands of GPUs, the probability of at least one network fault skyrockets; a library that mitigates that risk unlocks more aggressive scaling strategies.

Limitations & Future Work

Dependency on multiple NICs: Systems with a single network interface cannot benefit from R²CCL’s failover; the authors suggest exploring software‑based virtual NICs as a fallback.
Partial fault coverage: The current design assumes that at least one NIC per node remains functional; simultaneous multi‑NIC failures would still abort the job.
Integration depth: R²CCL is presented as a standalone library; tighter integration with popular frameworks (e.g., NCCL, Horovod) could reduce the learning curve.
Security considerations: Automatic reconnection across NICs may expose new attack surfaces; future work will harden the handshake protocol.

Authors

Wei Wang
Nengneng Yu
Sixian Xiong
Zaoxing Liu

Paper Information

arXiv ID: 2512.25059v1
Categories: cs.DC, cs.LG, cs.NI
Published: December 31, 2025
PDF: Download PDF

[Paper] Reliable and Resilient Collective Communication Library for LLM Training and Serving

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Two Deep Learning Approaches for Automated Segmentation of Left Ventricle in Cine Cardiac MRI

[Paper] Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning

[Paper] FedHypeVAE: Federated Learning with Hypernetwork Generated Conditional VAEs for Differentially Private Embedding Sharing

[Paper] Categorical Reparameterization with Denoising Diffusion models