[Paper] Co-Design and Evaluation of a CPU-Free MPI GPU Communication Abstraction and Implementation

Published: 3 days ago (February 16, 2026 at 11:50 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.15356v1

Overview

The paper presents a new MPI‑based programming interface that lets GPUs talk to each other without ever involving the CPU on the fast path. By co‑designing the API, the underlying network hardware (HPE Slingshot 11), and the software stack, the authors achieve noticeably lower latency and higher scalability for common GPU‑centric workloads such as halo‑exchange in stencil codes. This work is especially relevant for developers building large‑scale machine‑learning or HPC applications that already run on GPU‑only nodes.

Key Contributions

CPU‑free MPI GPU communication abstraction – an easy‑to‑use API that extends standard MPI semantics while eliminating host‑side involvement during data movement.
Integration with HPE Slingshot 11 NIC – leverages hardware offload capabilities (RDMA, GPUDirect) to achieve true zero‑CPU communication.
Co‑design methodology – simultaneous development of the API, its implementation, and the supporting network driver/firmware to avoid mismatches that often plague “add‑on” solutions.
Demonstration in Cabana/Kokkos – shows how the API naturally expresses gather/scatter halo‑exchange patterns used in many scientific codes.
Performance evaluation on Frontier and Tuolumne – up to 50 % lower latency for medium‑size ping‑pong messages and 28 % speedup for a strong‑scaled halo‑exchange benchmark on 8,192 GPUs.

Methodology

API Design – The authors extend MPI with a few new calls (e.g., MPIX_GpuSend, MPIX_GpuRecv) that accept device pointers directly. The semantics mirror familiar MPI point‑to‑point and collective operations, keeping the learning curve low.
Hardware Leveraging – They exploit the Slingshot 11 NIC’s GPUDirect RDMA and “GPU‑initiated” offload queues, allowing the GPU to issue network packets without CPU mediation.
Software Stack Integration – Modifications were made to the MPICH library to route the new calls through the NIC’s offload path, and to the driver/firmware to expose the necessary primitives.
Benchmark Integration – The API was plugged into the Cabana/Kokkos performance‑portability framework, replacing the traditional CPU‑mediated halo‑exchange with a GPU‑only version.
Experimental Evaluation – Tests were run on two of the world’s fastest systems (Frontier – AMD MI250X GPUs, and Tuolumne – NVIDIA GPUs) using both synthetic ping‑pong micro‑benchmarks and a realistic halo‑exchange kernel.

Results & Findings

Benchmark	System	Metric	Improvement
GPU ping‑pong (≈1 MiB)	Frontier	Latency	‑50 % vs. Cray MPICH
Halo‑exchange (strong scaling)	Frontier (8,192 GPUs)	Time‑to‑solution	+28 % speedup
Same tests on Tuolumne	NVIDIA‑based	Similar latency reductions, modest scaling gains	–

Key takeaways:

Removing the CPU from the communication path cuts the critical latency component that often dominates fine‑grained GPU kernels.
The API’s “MPI‑like” interface means existing codebases can adopt it with minimal refactoring.
Scaling to thousands of GPUs shows the approach remains robust under extreme concurrency, a crucial property for exascale workloads.

Practical Implications

Simplified GPU‑centric code – Developers can write pure‑GPU communication code without juggling low‑level RDMA verbs or custom CUDA kernels.
Performance‑critical ML pipelines – Distributed training that relies on all‑reduce or halo‑exchange (e.g., large convolutional nets) can benefit from lower latency and higher bandwidth.
Legacy MPI applications – By swapping a few MPI calls for the new GPU‑aware variants, existing scientific codes can achieve immediate speedups on GPU‑only nodes.
Future hardware compatibility – The co‑design model provides a blueprint for other NIC vendors (e.g., Mellanox, Intel) to expose similar offload paths, potentially standardizing CPU‑free GPU communication across platforms.

Limitations & Future Work

Hardware dependence – The current implementation is tightly coupled to HPE Slingshot 11; porting to other NICs will require additional driver/firmware work.
Message size sweet spot – Gains are most pronounced for medium‑sized messages; very small or extremely large transfers see diminishing returns.
Collective operations – Only point‑to‑point and simple gather/scatter primitives were demonstrated; extending the API to full‑featured collectives (e.g., all‑reduce) remains an open task.
Tooling & debugging – The new offload path introduces new failure modes that are not yet covered by standard MPI debuggers; richer diagnostics are needed.

Overall, the paper shows that a tightly integrated software‑hardware approach can finally deliver the long‑promised “CPU‑free” GPU communication, opening the door to faster, more scalable GPU‑only HPC and AI workloads.

Authors

Patrick G. Bridges
Derek Schafer
Jack Lange
James B. White
Anthony Skjellum
Evan Suggs
Thomas Hines
Purushotham Bangalore
Matthew G. F. Dosanjh
Whit Schonbein

Paper Information

arXiv ID: 2602.15356v1
Categories: cs.DC
Published: February 17, 2026
PDF: Download PDF

[Paper] Co-Design and Evaluation of a CPU-Free MPI GPU Communication Abstraction and Implementation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Exploring Novel Data Storage Approaches for Large-Scale Numerical Weather Prediction

[Paper] TopoSZp: Lightweight Topology-Aware Error-controlled Compression for Scientific Data

[Paper] Informative Trains: A Memory-Efficient Journey to a Self-Stabilizing Leader Election Algorithm in Anonymous Graphs

[Paper] Do GPUs Really Need New Tabular File Formats?