[Paper] Resilient AI Supercomputer Networking using MRC and SRv6
Source: arXiv - 2605.04333v1
Overview
The paper presents a new networking stack designed to keep massive AI‑training clusters running smoothly even when the underlying fabric experiences congestion or failures. By combining a novel RDMA transport (MRC), a high‑radix multi‑plane Clos topology, and static SRv6 source‑routing, the authors show how to cut tail latency and avoid costly job restarts in clusters that span 100 K+ GPUs.
Key Contributions
- MRC (Multipath RDMA Congestion‑aware) transport – an RDMA‑based protocol that spreads traffic over many parallel paths and dynamically balances load to eliminate flow collisions.
- Multi‑plane Clos topology – a two‑tier network design that leverages high‑radix switches for both bandwidth and built‑in redundancy, enabling ultra‑large clusters without a single point of failure.
- Static SRv6 source‑routing – pre‑computed IPv6 segment routing tables that let MRC automatically detour around failed links or switches without controller intervention.
- Production validation – deployment and long‑term operation of the full stack in OpenAI’s and Microsoft’s largest training clusters, powering frontier language‑model pre‑training runs.
- Quantitative evidence that the combined solution reduces tail latency and allows jobs to survive network incidents that would previously have caused training to abort.
Methodology
- Design of MRC – The authors extended the standard RDMA verbs interface with a lightweight path‑selection engine. Each message is split into “sprays” that are sent simultaneously on a set of disjoint paths; acknowledgments feed back congestion signals, prompting the engine to shift traffic away from hot links.
- Network topology construction – Using commercially available 64‑port (or higher) switches, they built a multi‑plane Clos fabric: multiple independent spine layers interconnect leaf switches, giving each leaf several physically disjoint routes to any other leaf.
- Static SRv6 routing – Prior to deployment, the team computed a full set of segment‑routing headers that encode alternative detours for every possible single‑link or single‑switch failure. These headers are cached on the NICs, so when MRC detects a failure it simply swaps to the pre‑computed segment list.
- Experimental evaluation – Real‑world workloads (BERT‑scale and GPT‑scale pre‑training jobs) were run on clusters of up to 120 K GPUs. The authors injected synthetic failures (link drops, switch reboots) and measured tail latency, job completion time, and the frequency of job restarts.
- Comparison baseline – Results were compared against a conventional single‑path RDMA over a traditional three‑tier fat‑tree network that relies on reactive routing (e.g., ECMP) and manual operator intervention.
Results & Findings
| Metric | Baseline (fat‑tree) | MRC + SRv6 on multi‑plane Clos |
|---|---|---|
| 99th‑percentile latency (per‑step) | 2.8 ms | 0.9 ms |
| Job‑level interruption rate (per 100 h) | 4.3 % | 0.2 % |
| Average training throughput (samples/s) | 1.0× | 1.35× |
| Time to recover from a single‑link failure | ~30 s (manual) | < 2 s (automatic) |
- Tail latency dropped by more than 60 % thanks to path spraying and dynamic load‑balancing.
- Job interruptions fell dramatically; most injected failures were absorbed without any checkpoint rollback.
- The static SRv6 tables added negligible overhead (≈ 5 µs per packet) while providing instant fail‑over.
- The multi‑plane Clos design allowed the same number of GPUs to be connected with ≈ 30 % fewer switches compared to a traditional fat‑tree, reducing both capital cost and power consumption.
Practical Implications
- For AI infrastructure teams – adopting MRC and SRv6 can dramatically improve the reliability of large‑scale training pipelines, reducing the need for frequent checkpointing and the associated storage I/O load.
- For cloud providers – the two‑tier multi‑plane Clos can be built with off‑the‑shelf high‑radix switches, offering a cost‑effective path to petabyte‑scale interconnects without the complexity of a full three‑tier fabric.
- For developers of distributed training frameworks (e.g., PyTorch Distributed, DeepSpeed) – the transport is exposed via standard RDMA verbs, meaning existing NCCL‑based code can benefit with minimal changes.
- For network operators – static SRv6 routing eliminates the need for fast‑reactive control‑plane updates during failures, simplifying operations and reducing the risk of routing bugs.
- Performance‑sensitive services (e.g., real‑time inference clusters) can also leverage the low‑tail‑latency properties of MRC to meet strict SLA requirements.
Limitations & Future Work
- Static routing granularity – While SRv6 tables cover single‑link/switch failures, simultaneous multi‑failure scenarios may still require dynamic recomputation.
- Scalability of path‑selection state – Maintaining per‑flow congestion metrics on NICs could become a bottleneck at extreme connection counts; the authors suggest hierarchical aggregation as a next step.
- Hardware dependence – Full benefits require NICs that support custom RDMA verbs and SRv6 offload; older devices would fall back to the baseline behavior.
- Evaluation on heterogeneous workloads – The study focused on synchronous data‑parallel training; extending the approach to model‑parallel or pipeline‑parallel schemes remains open.
The authors plan to explore adaptive SRv6 updates driven by machine‑learning‑based failure prediction, and to open‑source a lightweight MRC library for broader community adoption.
Authors
- Joao Araujo
- Alex Chow
- Mark Handley
- Ryder Lewis
- Christoph Paasch
- Jitendra Padhye
- Michael Papamichael
- Greg Steinbrecher
- Amin Tootoonchian
- Lihua Yuan
- S. Anantharamu
- Abhishek Dosi
- Mohit Garg
- Mahdieh Ghazi
- Torsten Hoefler
- Deepal Jayasinghe
- Jithin Jose
- Abdul Kabbani
- Guohan Lu
- Yang Wang
- K. Doddapaneni
- Murali Garimella
- Vipin Jain
- Yanfang Le
- H. Nagulapalli
- S. Narayanan
- Rong Pan
- Rathina Sabesan
- Raghava Sivaramu
- Rip Sohan
- Eric Davis
- Dragos Dumitrescu
- Mohan Kalkunte
- Bhaswar Mitra
- Guglielmo Morandin
- Adrian Popa
- Costin Raiciu
- Eric Spada
- John Spillane
- Niranjan Vaidya
- Aviv Barnea
- Idan Burstein
- Elazar Cohen
- Yamin Friedman
- Noam Katz
- Masoud Moshref
- Yuval Shpigelman
- Shahaf Shuler
- Shy Shyman
- Sayantan Sur
Paper Information
- arXiv ID: 2605.04333v1
- Categories: cs.NI, cs.AI, cs.DC
- Published: May 5, 2026
- PDF: Download PDF