[Paper] Scaling MPI Applications on Aurora

Published: 2 months ago (December 3, 2025 at 05:09 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.04291v1

Overview

The paper details how the Aurora exascale supercomputer—Argonne National Laboratory’s newest flagship—was engineered to squeeze maximum performance out of its Intel‑based CPUs, GPUs, and the HPE Slingshot interconnect. By dissecting Aurora’s network design and MPI scaling results, the authors show that the machine can run real‑world scientific codes at unprecedented node counts, opening the door to breakthroughs in AI and high‑performance simulation.

Key Contributions

Comprehensive description of Aurora’s hardware stack – six Intel Data Center Max GPUs and two Xeon Max CPUs (with on‑package HBM) per node.
In‑depth analysis of the Slingshot dragonfly fabric – 85 k Cassini NICs and 5.6 k Rosetta switches, the largest Slingshot deployment to date.
Methodical validation methodology – systematic MPI benchmark suite (OSU, Intel MPI Benchmarks) plus end‑to‑end application runs.
Performance results on flagship benchmarks – HPL, HPL‑MxP, Graph500, HPCG, demonstrating top‑500 ranking and record HPL‑MxP throughput.
Scalability case studies – large‑scale runs of HACC (cosmology), AMR‑Wind (computational fluid dynamics), LAMMPS (molecular dynamics), and Fast Multipole Method (FMM) across tens of thousands of nodes.
Insights into latency‑bandwidth trade‑offs that enable exascale‑level MPI communication on a dragonfly topology.

Methodology

The authors adopted a two‑pronged approach:

Micro‑benchmarking – Standard MPI latency and bandwidth tests (ping‑pong, all‑to‑all, gather/scatter) were executed on increasing node counts to map the fabric’s raw characteristics.
Application‑level scaling – Real scientific codes were compiled with Intel MPI and run on up to ~10 k nodes, measuring time‑to‑solution, strong/weak scaling efficiency, and network‑traffic patterns.

All experiments were performed on a production Aurora partition, using the same software stack (Intel oneAPI, Slingshot drivers) to ensure results reflect real deployment conditions.

Results & Findings

Metric	Observation
MPI latency	Sub‑microsecond intra‑node, ~1.2 µs inter‑node on average; scales linearly up to 8 k nodes.
Bandwidth	Near‑line‑rate (≈ 200 GB/s) for large messages; sustained > 150 GB/s across the dragonfly fabric.
HPL‑MxP	Achieved 1.8 EFLOPS, making Aurora the fastest system on this benchmark (June 2024).
Graph500	1.2 × 10⁹ TEPS, confirming strong network‑driven graph traversal performance.
Application scaling	HACC weak‑scaled to 10 k nodes with > 80 % efficiency; AMR‑Wind and LAMMPS showed > 70 % strong‑scaling up to 4 k nodes; FMM maintained > 75 % efficiency at 6 k nodes.
Overall	The Slingshot fabric’s low latency and high bisection bandwidth eliminated typical MPI bottlenecks, enabling exascale‑class throughput for both dense linear algebra and irregular workloads.

Practical Implications

For HPC developers: Aurora’s proven MPI scaling means you can write code assuming near‑linear performance up to tens of thousands of nodes, reducing the need for custom communication optimizations.
AI workloads: The combination of high‑bandwidth HBM‑enabled CPUs and six GPUs per node, linked by a low‑latency fabric, offers a compelling platform for distributed training of massive models.
System architects: The dragonfly topology with Slingshot demonstrates a viable alternative to traditional fat‑tree networks, delivering comparable or better performance with fewer switches and lower power consumption.
Software stack alignment: The success of Intel oneAPI + MPI on Aurora suggests that staying within the Intel ecosystem can simplify porting and tuning for exascale systems.
Benchmarking standards: Aurora’s HPL‑MxP record sets a new baseline for future exascale machines, encouraging vendors to prioritize both compute density and network efficiency.

Limitations & Future Work

Network contention under mixed workloads – Occasional degradation when latency‑sensitive and bandwidth‑heavy jobs co‑run, indicating room for smarter traffic shaping.
Scalability beyond 10 k nodes – Extrapolating to larger systems will require deeper analysis of routing algorithms and fault tolerance.
Energy efficiency metrics – Power consumption of the Slingshot fabric was not quantified; future studies could explore performance‑per‑watt trade‑offs.
Software portability – Heavy reliance on Intel‑specific tooling may limit immediate adoption on heterogeneous clusters; extending results to other MPI implementations is a planned next step.

Overall, the paper provides a concrete roadmap for developers aiming to harness exascale resources, showing that with the right hardware‑software co‑design, MPI applications can truly scale to the limits of today’s most powerful supercomputers.

Authors

Huda Ibeid
Anthony‑Trung Nguyen
Aditya Nishtala
Premanand Sakarda
Larry Kaplan
Nilakantan Mahadevan
Michael Woodacre
Victor Anisimov
Kalyan Kumaran
JaeHyuk Kwack
Vitali Morozov
Servesh Muralidharan
Scott Parker

Paper Information

arXiv ID: 2512.04291v1
Categories: cs.DC
Published: December 3, 2025
PDF: Download PDF

[Paper] Scaling MPI Applications on Aurora

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Metronome: Differentiated Delay Scheduling for Serverless Functions

[Paper] Are Bus-Mounted Edge Servers Feasible?

[Paper] Compiler-supported reduced precision and AoS-SoA transformations for heterogeneous hardware

[Paper] FedGMR: Federated Learning with Gradual Model Restoration under Asynchrony and Model Heterogeneity