[Paper] Scaling MPI Applications on Aurora
Source: arXiv - 2512.04291v1
Overview
The paper details how the Aurora exascale supercomputer—Argonne National Laboratory’s newest flagship—was engineered to squeeze maximum performance out of its Intel‑based CPUs, GPUs, and the HPE Slingshot interconnect. By dissecting Aurora’s network design and MPI scaling results, the authors show that the machine can run real‑world scientific codes at unprecedented node counts, opening the door to breakthroughs in AI and high‑performance simulation.
Key Contributions
- Comprehensive description of Aurora’s hardware stack – six Intel Data Center Max GPUs and two Xeon Max CPUs (with on‑package HBM) per node.
- In‑depth analysis of the Slingshot dragonfly fabric – 85 k Cassini NICs and 5.6 k Rosetta switches, the largest Slingshot deployment to date.
- Methodical validation methodology – systematic MPI benchmark suite (OSU, Intel MPI Benchmarks) plus end‑to‑end application runs.
- Performance results on flagship benchmarks – HPL, HPL‑MxP, Graph500, HPCG, demonstrating top‑500 ranking and record HPL‑MxP throughput.
- Scalability case studies – large‑scale runs of HACC (cosmology), AMR‑Wind (computational fluid dynamics), LAMMPS (molecular dynamics), and Fast Multipole Method (FMM) across tens of thousands of nodes.
- Insights into latency‑bandwidth trade‑offs that enable exascale‑level MPI communication on a dragonfly topology.
Methodology
The authors adopted a two‑pronged approach:
- Micro‑benchmarking – Standard MPI latency and bandwidth tests (ping‑pong, all‑to‑all, gather/scatter) were executed on increasing node counts to map the fabric’s raw characteristics.
- Application‑level scaling – Real scientific codes were compiled with Intel MPI and run on up to ~10 k nodes, measuring time‑to‑solution, strong/weak scaling efficiency, and network‑traffic patterns.
All experiments were performed on a production Aurora partition, using the same software stack (Intel oneAPI, Slingshot drivers) to ensure results reflect real deployment conditions.
Results & Findings
| Metric | Observation |
|---|---|
| MPI latency | Sub‑microsecond intra‑node, ~1.2 µs inter‑node on average; scales linearly up to 8 k nodes. |
| Bandwidth | Near‑line‑rate (≈ 200 GB/s) for large messages; sustained > 150 GB/s across the dragonfly fabric. |
| HPL‑MxP | Achieved 1.8 EFLOPS, making Aurora the fastest system on this benchmark (June 2024). |
| Graph500 | 1.2 × 10⁹ TEPS, confirming strong network‑driven graph traversal performance. |
| Application scaling | HACC weak‑scaled to 10 k nodes with > 80 % efficiency; AMR‑Wind and LAMMPS showed > 70 % strong‑scaling up to 4 k nodes; FMM maintained > 75 % efficiency at 6 k nodes. |
| Overall | The Slingshot fabric’s low latency and high bisection bandwidth eliminated typical MPI bottlenecks, enabling exascale‑class throughput for both dense linear algebra and irregular workloads. |
Practical Implications
- For HPC developers: Aurora’s proven MPI scaling means you can write code assuming near‑linear performance up to tens of thousands of nodes, reducing the need for custom communication optimizations.
- AI workloads: The combination of high‑bandwidth HBM‑enabled CPUs and six GPUs per node, linked by a low‑latency fabric, offers a compelling platform for distributed training of massive models.
- System architects: The dragonfly topology with Slingshot demonstrates a viable alternative to traditional fat‑tree networks, delivering comparable or better performance with fewer switches and lower power consumption.
- Software stack alignment: The success of Intel oneAPI + MPI on Aurora suggests that staying within the Intel ecosystem can simplify porting and tuning for exascale systems.
- Benchmarking standards: Aurora’s HPL‑MxP record sets a new baseline for future exascale machines, encouraging vendors to prioritize both compute density and network efficiency.
Limitations & Future Work
- Network contention under mixed workloads – Occasional degradation when latency‑sensitive and bandwidth‑heavy jobs co‑run, indicating room for smarter traffic shaping.
- Scalability beyond 10 k nodes – Extrapolating to larger systems will require deeper analysis of routing algorithms and fault tolerance.
- Energy efficiency metrics – Power consumption of the Slingshot fabric was not quantified; future studies could explore performance‑per‑watt trade‑offs.
- Software portability – Heavy reliance on Intel‑specific tooling may limit immediate adoption on heterogeneous clusters; extending results to other MPI implementations is a planned next step.
Overall, the paper provides a concrete roadmap for developers aiming to harness exascale resources, showing that with the right hardware‑software co‑design, MPI applications can truly scale to the limits of today’s most powerful supercomputers.
Authors
- Huda Ibeid
- Anthony‑Trung Nguyen
- Aditya Nishtala
- Premanand Sakarda
- Larry Kaplan
- Nilakantan Mahadevan
- Michael Woodacre
- Victor Anisimov
- Kalyan Kumaran
- JaeHyuk Kwack
- Vitali Morozov
- Servesh Muralidharan
- Scott Parker
Paper Information
- arXiv ID: 2512.04291v1
- Categories: cs.DC
- Published: December 3, 2025
- PDF: Download PDF