[Paper] Resilient AI Supercomputer Networking using MRC and SRv6

Published: 5 days ago (May 5, 2026 at 06:40 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.04333v1

Overview

The paper presents a new networking stack designed to keep massive AI‑training clusters running smoothly even when the underlying fabric experiences congestion or failures. By combining a novel RDMA transport (MRC), a high‑radix multi‑plane Clos topology, and static SRv6 source‑routing, the authors show how to cut tail latency and avoid costly job restarts in clusters that span 100 K+ GPUs.

Key Contributions

MRC (Multipath RDMA Congestion‑aware) transport – an RDMA‑based protocol that spreads traffic over many parallel paths and dynamically balances load to eliminate flow collisions.
Multi‑plane Clos topology – a two‑tier network design that leverages high‑radix switches for both bandwidth and built‑in redundancy, enabling ultra‑large clusters without a single point of failure.
Static SRv6 source‑routing – pre‑computed IPv6 segment routing tables that let MRC automatically detour around failed links or switches without controller intervention.
Production validation – deployment and long‑term operation of the full stack in OpenAI’s and Microsoft’s largest training clusters, powering frontier language‑model pre‑training runs.
Quantitative evidence that the combined solution reduces tail latency and allows jobs to survive network incidents that would previously have caused training to abort.

Methodology

Design of MRC – The authors extended the standard RDMA verbs interface with a lightweight path‑selection engine. Each message is split into “sprays” that are sent simultaneously on a set of disjoint paths; acknowledgments feed back congestion signals, prompting the engine to shift traffic away from hot links.
Network topology construction – Using commercially available 64‑port (or higher) switches, they built a multi‑plane Clos fabric: multiple independent spine layers interconnect leaf switches, giving each leaf several physically disjoint routes to any other leaf.
Static SRv6 routing – Prior to deployment, the team computed a full set of segment‑routing headers that encode alternative detours for every possible single‑link or single‑switch failure. These headers are cached on the NICs, so when MRC detects a failure it simply swaps to the pre‑computed segment list.
Experimental evaluation – Real‑world workloads (BERT‑scale and GPT‑scale pre‑training jobs) were run on clusters of up to 120 K GPUs. The authors injected synthetic failures (link drops, switch reboots) and measured tail latency, job completion time, and the frequency of job restarts.
Comparison baseline – Results were compared against a conventional single‑path RDMA over a traditional three‑tier fat‑tree network that relies on reactive routing (e.g., ECMP) and manual operator intervention.

Results & Findings

Metric	Baseline (fat‑tree)	MRC + SRv6 on multi‑plane Clos
99th‑percentile latency (per‑step)	2.8 ms	0.9 ms
Job‑level interruption rate (per 100 h)	4.3 %	0.2 %
Average training throughput (samples/s)	1.0×	1.35×
Time to recover from a single‑link failure	~30 s (manual)	< 2 s (automatic)

Tail latency dropped by more than 60 % thanks to path spraying and dynamic load‑balancing.
Job interruptions fell dramatically; most injected failures were absorbed without any checkpoint rollback.
The static SRv6 tables added negligible overhead (≈ 5 µs per packet) while providing instant fail‑over.
The multi‑plane Clos design allowed the same number of GPUs to be connected with ≈ 30 % fewer switches compared to a traditional fat‑tree, reducing both capital cost and power consumption.

Practical Implications

For AI infrastructure teams – adopting MRC and SRv6 can dramatically improve the reliability of large‑scale training pipelines, reducing the need for frequent checkpointing and the associated storage I/O load.
For cloud providers – the two‑tier multi‑plane Clos can be built with off‑the‑shelf high‑radix switches, offering a cost‑effective path to petabyte‑scale interconnects without the complexity of a full three‑tier fabric.
For developers of distributed training frameworks (e.g., PyTorch Distributed, DeepSpeed) – the transport is exposed via standard RDMA verbs, meaning existing NCCL‑based code can benefit with minimal changes.
For network operators – static SRv6 routing eliminates the need for fast‑reactive control‑plane updates during failures, simplifying operations and reducing the risk of routing bugs.
Performance‑sensitive services (e.g., real‑time inference clusters) can also leverage the low‑tail‑latency properties of MRC to meet strict SLA requirements.

Limitations & Future Work

Static routing granularity – While SRv6 tables cover single‑link/switch failures, simultaneous multi‑failure scenarios may still require dynamic recomputation.
Scalability of path‑selection state – Maintaining per‑flow congestion metrics on NICs could become a bottleneck at extreme connection counts; the authors suggest hierarchical aggregation as a next step.
Hardware dependence – Full benefits require NICs that support custom RDMA verbs and SRv6 offload; older devices would fall back to the baseline behavior.
Evaluation on heterogeneous workloads – The study focused on synchronous data‑parallel training; extending the approach to model‑parallel or pipeline‑parallel schemes remains open.

The authors plan to explore adaptive SRv6 updates driven by machine‑learning‑based failure prediction, and to open‑source a lightweight MRC library for broader community adoption.

Authors

Joao Araujo
Alex Chow
Mark Handley
Ryder Lewis
Christoph Paasch
Jitendra Padhye
Michael Papamichael
Greg Steinbrecher
Amin Tootoonchian
Lihua Yuan
S. Anantharamu
Abhishek Dosi
Mohit Garg
Mahdieh Ghazi
Torsten Hoefler
Deepal Jayasinghe
Jithin Jose
Abdul Kabbani
Guohan Lu
Yang Wang
K. Doddapaneni
Murali Garimella
Vipin Jain
Yanfang Le
H. Nagulapalli
S. Narayanan
Rong Pan
Rathina Sabesan
Raghava Sivaramu
Rip Sohan
Eric Davis
Dragos Dumitrescu
Mohan Kalkunte
Bhaswar Mitra
Guglielmo Morandin
Adrian Popa
Costin Raiciu
Eric Spada
John Spillane
Niranjan Vaidya
Aviv Barnea
Idan Burstein
Elazar Cohen
Yamin Friedman
Noam Katz
Masoud Moshref
Yuval Shpigelman
Shahaf Shuler
Shy Shyman
Sayantan Sur

Paper Information

arXiv ID: 2605.04333v1
Categories: cs.NI, cs.AI, cs.DC
Published: May 5, 2026
PDF: Download PDF

[Paper] Resilient AI Supercomputer Networking using MRC and SRv6

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Normalizing Trajectory Models

[Paper] Zero-Shot Imagined Speech Decoding via Imagined-to-Listened MEG Mapping

[Paper] GRAPHLCP: Structure-Aware Localized Conformal Prediction on Graphs

[Paper] EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction