[Paper] Radiation Hydrodynamics at Scale: Comparing MPI and Asynchronous Many-Task Runtimes with FleCSI

Published: 15 hours ago (March 5, 2026 at 11:44 AM EST)

5 min read

Source: arXiv

Source: arXiv - 2603.05366v1

Overview

The paper evaluates how the FleCSI framework—an abstraction layer that lets scientists write high‑level, task‑based code—performs when backed by three different parallel runtimes: classic MPI, and the newer asynchronous many‑task runtimes (AMTRs) Legion and HPX. By benchmarking a simple Poisson solver and a full‑featured radiation‑hydrodynamics application (HARD) on up to 1 024 nodes (≈ 131 k cores), the authors quantify the trade‑offs between ease‑of‑use and raw performance at extreme scale.

Key Contributions

Unified benchmark suite for a communication‑heavy (Poisson) and a compute‑heavy (radiation hydrodynamics) application, both built on FleCSI.
Head‑to‑head comparison of three backends (MPI, Legion, HPX) on up to 1024 nodes, exposing scaling behavior and overheads.
Demonstration that FleCSI’s MPI backend incurs < 3 % overhead, achieving > 97 % parallel efficiency on weak‑scaled Poisson runs.
Identification of Legion’s scaling bottlenecks (significant overhead, limited weak‑scale growth).
Evidence that HPX can match or exceed MPI+Kokkos performance for compute‑intensive workloads, provided collective operations are tuned.
Insight into how asynchronous tasking can hide communication latency and improve performance on modest node counts (< 64).

Methodology

Framework Setup – The authors built both applications using FleCSI’s high‑level API, which automatically maps user tasks onto the chosen runtime (MPI, Legion, or HPX).
Hardware & Scaling – Experiments ran on a Cray‑type cluster (dual‑socket CPUs, high‑speed interconnect) from 1 node up to 1024 nodes (≈ 131 k cores). Two scaling modes were used:
- Weak scaling – problem size per node stays constant, testing how well the runtime handles increasing communication volume.
- Strong scaling – total problem size fixed, measuring how efficiently the runtime reduces time‑to‑solution as more resources are added.
Metrics – Parallel efficiency, total runtime, and speed‑up relative to the MPI+Kokkos baseline were recorded. The Poisson solver stresses the communication layer, while HARD stresses computation and memory bandwidth.
Backend Configuration –
- MPI: standard MPI‑aware FleCSI + Kokkos for on‑node parallelism.
- Legion: FleCSI’s Legion backend with default task graph generation.
- HPX: FleCSI’s HPX backend, using HPX’s lightweight threads and futures for async execution; collective ops left at their current (non‑optimized) implementation.

Results & Findings

Benchmark	Backend	Weak‑scale Parallel Efficiency	Speed‑up vs. MPI+Kokkos
Poisson (comm‑bound)	MPI	> 97 % (up to 131 072 cores)	Baseline
	Legion	Noticeable overhead; efficiency drops sharply after ~ 256 nodes	< 1.0 (slower)
	HPX	Marginal overhead; efficiency comparable to MPI	≈ 0.98–1.02 (near parity)
HARD (compute‑bound)	MPI	Good scaling but slower than HPX on small node counts	Baseline
	HPX	Outperforms MPI+Kokkos on < 64 nodes (weak: +31 % speed‑up; strong: +27 %)	1.31 (weak), 1.27 (strong)
	HPX (hydro‑only)	Up to +20 % vs. MPI, +64 % vs. MPI+Kokkos on < 32 nodes	1.20 (vs MPI), 1.64 (vs MPI+Kokkos)
Legion (HARD)	Legion	Similar to MPI for small runs, but scaling stalls beyond ~ 128 nodes	≈ 1.0 (small), degrades later

Key takeaways

MPI remains the most robust choice for pure communication‑heavy workloads; FleCSI’s abstraction adds virtually no penalty.
HPX shines when the workload is compute‑intensive and the node count is modest, leveraging asynchronous tasks to overlap work and communication.
Legion’s current implementation in FleCSI is not yet ready for large‑scale weak scaling, likely due to task‑graph overhead and sub‑optimal data movement.

Practical Implications

For developers of large‑scale scientific codes (e.g., astrophysics, climate, CFD), FleCSI offers a single source that can be compiled against MPI, Legion, or HPX, letting teams experiment with different runtimes without rewriting core algorithms.
HPX as a drop‑in replacement could be a win for applications that are compute‑bound and run on clusters with ≤ 64 nodes—common in early‑stage research or when budget constraints limit node count.
MPI remains the safe default for production runs on supercomputers where communication dominates (e.g., multigrid solvers, global reductions).
Performance‑critical sections (collective operations, reductions) may need custom tuning when using HPX; the paper’s results suggest that once HPX’s collectives are optimized, its advantage could extend to larger scales.
Portability: By abstracting the runtime, FleCSI reduces the engineering effort required to target emerging exascale architectures that may favor task‑based runtimes over traditional MPI.

Limitations & Future Work

Legion backend: The study reports significant scaling bottlenecks, indicating that FleCSI’s integration with Legion needs further optimization (e.g., better task‑graph partitioning, reduced runtime overhead).
HPX collectives: Current non‑optimized collective operations limit scalability beyond ~ 64 nodes; future work should benchmark HPX with its upcoming collective improvements.
Hardware diversity: Experiments were confined to a single CPU‑based cluster; extending the study to GPUs, ARM‑based nodes, or heterogeneous systems would clarify how FleCSI’s backends perform on emerging architectures.
Application breadth: Only two codes were evaluated; adding more diverse workloads (e.g., irregular graph analytics, machine‑learning pipelines) would strengthen the generality of the conclusions.
Energy efficiency: The paper focuses on runtime; measuring power consumption across backends could be valuable for exascale sustainability considerations.

Bottom line: FleCSI’s high‑level, task‑based programming model can deliver near‑native MPI performance while offering a pathway to experiment with modern asynchronous runtimes. For compute‑heavy scientific codes on modest node counts, HPX already shows a measurable speed‑up, whereas Legion still needs work before it can compete at scale. Developers can leverage this flexibility to future‑proof their codes as the HPC ecosystem evolves.

Authors

Alexander Strack
Hartmut Kaiser
Dirk Pflüger

Paper Information

arXiv ID: 2603.05366v1
Categories: cs.DC
Published: March 5, 2026
PDF: Download PDF

[Paper] Radiation Hydrodynamics at Scale: Comparing MPI and Asynchronous Many-Task Runtimes with FleCSI

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] A monitoring system for collecting and aggregating metrics from distributed clouds

[Paper] Scaling Real-Time Traffic Analytics on Edge-Cloud Fabrics for City-Scale Camera Networks

[Paper] Leveraging Structural Knowledge for Solving Election in Anonymous Networks with Shared Randomness

[Paper] PromptTuner: SLO-Aware Elastic System for LLM Prompt Tuning