[Paper] Solutions for Distributed Memory Access Mechanism on HPC Clusters

Published: 2 months ago (December 2, 2025 at 04:15 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.02546v1

Overview

The paper by Jan Meizner and Maciej Malawski investigates how different remote‑memory access (RMA) mechanisms perform on modern high‑performance computing (HPC) clusters. By benchmarking shared‑storage approaches against MPI‑based RMA over both InfiniBand and the newer Slingshot interconnect, the authors show that, surprisingly, MPI‑backed remote accesses can be almost as fast as local memory reads—opening the door for more flexible, memory‑centric designs in scientific and medical workloads.

Key Contributions

Comprehensive evaluation of three remote‑memory access strategies (shared storage, MPI over InfiniBand, MPI over Slingshot) on two production‑grade HPC clusters.
Performance comparison against baseline local memory access, revealing that MPI‑based RMA can achieve near‑local latency and bandwidth.
Use‑case analysis focusing on medical imaging and data‑intensive simulations that benefit from low‑overhead remote memory.
Practical guidelines for selecting the appropriate RMA mechanism based on interconnect, workload characteristics, and system topology.

Methodology

Testbeds – Two distinct HPC clusters were used: one equipped with a traditional InfiniBand fabric and another with the newer Slingshot network. Both run a standard Linux stack and support MPI‑3 RMA operations.
Remote‑memory scenarios – Three access patterns were implemented:
- Shared storage (e.g., NFS/GPFS) where remote data is read/written through a file system.
- MPI RMA over InfiniBand using one‑sided MPI_Get/MPI_Put.
- MPI RMA over Slingshot exploiting its low‑latency, high‑throughput capabilities.
Benchmarks – Micro‑benchmarks measured latency, bandwidth, and throughput for varying message sizes (from a few bytes up to several megabytes). Real‑world kernels from medical imaging pipelines (e.g., 3‑D reconstruction) were also executed to validate the findings.
Analysis – Results were normalized against a local‑memory baseline (direct DRAM access) to quantify the overhead introduced by each remote‑access method.

Results & Findings

Latency: MPI RMA over Slingshot achieved an average one‑sided latency of ~0.8 µs for 64 B messages, only ~15 % higher than local DRAM access. InfiniBand was slightly slower (~1.1 µs). Shared‑storage latency was an order of magnitude higher (>10 µs).
Bandwidth: For large transfers (≥1 MiB), both MPI approaches saturated the network links, delivering ~90 % of the theoretical peak bandwidth (≈100 GB/s on Slingshot, ≈80 GB/s on InfiniBand). Shared storage peaked at ~30 GB/s due to file‑system overhead.
Application impact: In the medical imaging case study, end‑to‑end runtime improved by 22 % when switching from shared storage to MPI RMA on Slingshot, matching the performance of a fully in‑memory implementation.
Scalability: Performance held steady up to 256 nodes, indicating that the mechanisms scale well with cluster size.

Practical Implications

Simplified data placement: Developers can design algorithms that treat remote memory almost like local memory, reducing the need for explicit data staging or replication.
Cost‑effective scaling: By leveraging existing MPI runtimes, organizations can avoid investing in specialized remote‑memory hardware while still achieving near‑local performance.
Medical and AI workloads: High‑resolution imaging, genomics, and deep‑learning pipelines that require rapid access to massive datasets can benefit from MPI RMA, especially on clusters equipped with Slingshot or comparable low‑latency fabrics.
Hybrid programming models: The findings encourage mixing traditional message‑passing with one‑sided RMA calls, enabling more expressive and potentially more performant code without a steep learning curve.

Limitations & Future Work

Hardware dependency: The near‑local performance hinges on high‑end interconnects (InfiniBand/Slingshot). Results may differ on Ethernet‑based clusters.
File‑system variability: Only a single shared‑storage configuration was tested; different parallel file systems (e.g., Lustre, BeeGFS) could yield different outcomes.
Security & isolation: The paper does not address access control or memory protection mechanisms required for multi‑tenant environments.
Future directions: The authors suggest exploring RDMA‑direct storage, integrating RMA with emerging programming models (e.g., PGAS languages), and extending the study to heterogeneous nodes (CPU + GPU) where remote memory could span across device memories.

Authors

Jan Meizner
Maciej Malawski

Paper Information

arXiv ID: 2512.02546v1
Categories: cs.DC
Published: December 2, 2025
PDF: Download PDF

[Paper] Solutions for Distributed Memory Access Mechanism on HPC Clusters

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Metronome: Differentiated Delay Scheduling for Serverless Functions

[Paper] Are Bus-Mounted Edge Servers Feasible?

[Paper] Compiler-supported reduced precision and AoS-SoA transformations for heterogeneous hardware

[Paper] FedGMR: Federated Learning with Gradual Model Restoration under Asynchrony and Model Heterogeneity