[Paper] Radiation Hydrodynamics at Scale: Comparing MPI and Asynchronous Many-Task Runtimes with FleCSI
Source: arXiv - 2603.05366v1
Overview
The paper evaluates how the FleCSI framework—an abstraction layer that lets scientists write high‑level, task‑based code—performs when backed by three different parallel runtimes: classic MPI, and the newer asynchronous many‑task runtimes (AMTRs) Legion and HPX. By benchmarking a simple Poisson solver and a full‑featured radiation‑hydrodynamics application (HARD) on up to 1 024 nodes (≈ 131 k cores), the authors quantify the trade‑offs between ease‑of‑use and raw performance at extreme scale.
Key Contributions
- Unified benchmark suite for a communication‑heavy (Poisson) and a compute‑heavy (radiation hydrodynamics) application, both built on FleCSI.
- Head‑to‑head comparison of three backends (MPI, Legion, HPX) on up to 1024 nodes, exposing scaling behavior and overheads.
- Demonstration that FleCSI’s MPI backend incurs < 3 % overhead, achieving > 97 % parallel efficiency on weak‑scaled Poisson runs.
- Identification of Legion’s scaling bottlenecks (significant overhead, limited weak‑scale growth).
- Evidence that HPX can match or exceed MPI+Kokkos performance for compute‑intensive workloads, provided collective operations are tuned.
- Insight into how asynchronous tasking can hide communication latency and improve performance on modest node counts (< 64).
Methodology
- Framework Setup – The authors built both applications using FleCSI’s high‑level API, which automatically maps user tasks onto the chosen runtime (MPI, Legion, or HPX).
- Hardware & Scaling – Experiments ran on a Cray‑type cluster (dual‑socket CPUs, high‑speed interconnect) from 1 node up to 1024 nodes (≈ 131 k cores). Two scaling modes were used:
- Weak scaling – problem size per node stays constant, testing how well the runtime handles increasing communication volume.
- Strong scaling – total problem size fixed, measuring how efficiently the runtime reduces time‑to‑solution as more resources are added.
- Metrics – Parallel efficiency, total runtime, and speed‑up relative to the MPI+Kokkos baseline were recorded. The Poisson solver stresses the communication layer, while HARD stresses computation and memory bandwidth.
- Backend Configuration –
- MPI: standard MPI‑aware FleCSI + Kokkos for on‑node parallelism.
- Legion: FleCSI’s Legion backend with default task graph generation.
- HPX: FleCSI’s HPX backend, using HPX’s lightweight threads and futures for async execution; collective ops left at their current (non‑optimized) implementation.
Results & Findings
| Benchmark | Backend | Weak‑scale Parallel Efficiency | Speed‑up vs. MPI+Kokkos |
|---|---|---|---|
| Poisson (comm‑bound) | MPI | > 97 % (up to 131 072 cores) | Baseline |
| Legion | Noticeable overhead; efficiency drops sharply after ~ 256 nodes | < 1.0 (slower) | |
| HPX | Marginal overhead; efficiency comparable to MPI | ≈ 0.98–1.02 (near parity) | |
| HARD (compute‑bound) | MPI | Good scaling but slower than HPX on small node counts | Baseline |
| HPX | Outperforms MPI+Kokkos on < 64 nodes (weak: +31 % speed‑up; strong: +27 %) | 1.31 (weak), 1.27 (strong) | |
| HPX (hydro‑only) | Up to +20 % vs. MPI, +64 % vs. MPI+Kokkos on < 32 nodes | 1.20 (vs MPI), 1.64 (vs MPI+Kokkos) | |
| Legion (HARD) | Legion | Similar to MPI for small runs, but scaling stalls beyond ~ 128 nodes | ≈ 1.0 (small), degrades later |
Key takeaways
- MPI remains the most robust choice for pure communication‑heavy workloads; FleCSI’s abstraction adds virtually no penalty.
- HPX shines when the workload is compute‑intensive and the node count is modest, leveraging asynchronous tasks to overlap work and communication.
- Legion’s current implementation in FleCSI is not yet ready for large‑scale weak scaling, likely due to task‑graph overhead and sub‑optimal data movement.
Practical Implications
- For developers of large‑scale scientific codes (e.g., astrophysics, climate, CFD), FleCSI offers a single source that can be compiled against MPI, Legion, or HPX, letting teams experiment with different runtimes without rewriting core algorithms.
- HPX as a drop‑in replacement could be a win for applications that are compute‑bound and run on clusters with ≤ 64 nodes—common in early‑stage research or when budget constraints limit node count.
- MPI remains the safe default for production runs on supercomputers where communication dominates (e.g., multigrid solvers, global reductions).
- Performance‑critical sections (collective operations, reductions) may need custom tuning when using HPX; the paper’s results suggest that once HPX’s collectives are optimized, its advantage could extend to larger scales.
- Portability: By abstracting the runtime, FleCSI reduces the engineering effort required to target emerging exascale architectures that may favor task‑based runtimes over traditional MPI.
Limitations & Future Work
- Legion backend: The study reports significant scaling bottlenecks, indicating that FleCSI’s integration with Legion needs further optimization (e.g., better task‑graph partitioning, reduced runtime overhead).
- HPX collectives: Current non‑optimized collective operations limit scalability beyond ~ 64 nodes; future work should benchmark HPX with its upcoming collective improvements.
- Hardware diversity: Experiments were confined to a single CPU‑based cluster; extending the study to GPUs, ARM‑based nodes, or heterogeneous systems would clarify how FleCSI’s backends perform on emerging architectures.
- Application breadth: Only two codes were evaluated; adding more diverse workloads (e.g., irregular graph analytics, machine‑learning pipelines) would strengthen the generality of the conclusions.
- Energy efficiency: The paper focuses on runtime; measuring power consumption across backends could be valuable for exascale sustainability considerations.
Bottom line: FleCSI’s high‑level, task‑based programming model can deliver near‑native MPI performance while offering a pathway to experiment with modern asynchronous runtimes. For compute‑heavy scientific codes on modest node counts, HPX already shows a measurable speed‑up, whereas Legion still needs work before it can compete at scale. Developers can leverage this flexibility to future‑proof their codes as the HPC ecosystem evolves.
Authors
- Alexander Strack
- Hartmut Kaiser
- Dirk Pflüger
Paper Information
- arXiv ID: 2603.05366v1
- Categories: cs.DC
- Published: March 5, 2026
- PDF: Download PDF