[Paper] Co-Design and Evaluation of a CPU-Free MPI GPU Communication Abstraction and Implementation
Source: arXiv - 2602.15356v1
Overview
The paper presents a new MPI‑based programming interface that lets GPUs talk to each other without ever involving the CPU on the fast path. By co‑designing the API, the underlying network hardware (HPE Slingshot 11), and the software stack, the authors achieve noticeably lower latency and higher scalability for common GPU‑centric workloads such as halo‑exchange in stencil codes. This work is especially relevant for developers building large‑scale machine‑learning or HPC applications that already run on GPU‑only nodes.
Key Contributions
- CPU‑free MPI GPU communication abstraction – an easy‑to‑use API that extends standard MPI semantics while eliminating host‑side involvement during data movement.
- Integration with HPE Slingshot 11 NIC – leverages hardware offload capabilities (RDMA, GPUDirect) to achieve true zero‑CPU communication.
- Co‑design methodology – simultaneous development of the API, its implementation, and the supporting network driver/firmware to avoid mismatches that often plague “add‑on” solutions.
- Demonstration in Cabana/Kokkos – shows how the API naturally expresses gather/scatter halo‑exchange patterns used in many scientific codes.
- Performance evaluation on Frontier and Tuolumne – up to 50 % lower latency for medium‑size ping‑pong messages and 28 % speedup for a strong‑scaled halo‑exchange benchmark on 8,192 GPUs.
Methodology
- API Design – The authors extend MPI with a few new calls (e.g.,
MPIX_GpuSend,MPIX_GpuRecv) that accept device pointers directly. The semantics mirror familiar MPI point‑to‑point and collective operations, keeping the learning curve low. - Hardware Leveraging – They exploit the Slingshot 11 NIC’s GPUDirect RDMA and “GPU‑initiated” offload queues, allowing the GPU to issue network packets without CPU mediation.
- Software Stack Integration – Modifications were made to the MPICH library to route the new calls through the NIC’s offload path, and to the driver/firmware to expose the necessary primitives.
- Benchmark Integration – The API was plugged into the Cabana/Kokkos performance‑portability framework, replacing the traditional CPU‑mediated halo‑exchange with a GPU‑only version.
- Experimental Evaluation – Tests were run on two of the world’s fastest systems (Frontier – AMD MI250X GPUs, and Tuolumne – NVIDIA GPUs) using both synthetic ping‑pong micro‑benchmarks and a realistic halo‑exchange kernel.
Results & Findings
| Benchmark | System | Metric | Improvement |
|---|---|---|---|
| GPU ping‑pong (≈1 MiB) | Frontier | Latency | ‑50 % vs. Cray MPICH |
| Halo‑exchange (strong scaling) | Frontier (8,192 GPUs) | Time‑to‑solution | +28 % speedup |
| Same tests on Tuolumne | NVIDIA‑based | Similar latency reductions, modest scaling gains | – |
Key takeaways:
- Removing the CPU from the communication path cuts the critical latency component that often dominates fine‑grained GPU kernels.
- The API’s “MPI‑like” interface means existing codebases can adopt it with minimal refactoring.
- Scaling to thousands of GPUs shows the approach remains robust under extreme concurrency, a crucial property for exascale workloads.
Practical Implications
- Simplified GPU‑centric code – Developers can write pure‑GPU communication code without juggling low‑level RDMA verbs or custom CUDA kernels.
- Performance‑critical ML pipelines – Distributed training that relies on all‑reduce or halo‑exchange (e.g., large convolutional nets) can benefit from lower latency and higher bandwidth.
- Legacy MPI applications – By swapping a few MPI calls for the new GPU‑aware variants, existing scientific codes can achieve immediate speedups on GPU‑only nodes.
- Future hardware compatibility – The co‑design model provides a blueprint for other NIC vendors (e.g., Mellanox, Intel) to expose similar offload paths, potentially standardizing CPU‑free GPU communication across platforms.
Limitations & Future Work
- Hardware dependence – The current implementation is tightly coupled to HPE Slingshot 11; porting to other NICs will require additional driver/firmware work.
- Message size sweet spot – Gains are most pronounced for medium‑sized messages; very small or extremely large transfers see diminishing returns.
- Collective operations – Only point‑to‑point and simple gather/scatter primitives were demonstrated; extending the API to full‑featured collectives (e.g., all‑reduce) remains an open task.
- Tooling & debugging – The new offload path introduces new failure modes that are not yet covered by standard MPI debuggers; richer diagnostics are needed.
Overall, the paper shows that a tightly integrated software‑hardware approach can finally deliver the long‑promised “CPU‑free” GPU communication, opening the door to faster, more scalable GPU‑only HPC and AI workloads.
Authors
- Patrick G. Bridges
- Derek Schafer
- Jack Lange
- James B. White
- Anthony Skjellum
- Evan Suggs
- Thomas Hines
- Purushotham Bangalore
- Matthew G. F. Dosanjh
- Whit Schonbein
Paper Information
- arXiv ID: 2602.15356v1
- Categories: cs.DC
- Published: February 17, 2026
- PDF: Download PDF