[Paper] Radiation Hydrodynamics at Scale: Comparing MPI and Asynchronous Many-Task Runtimes with FleCSI

Published: (March 5, 2026 at 11:44 AM EST)
5 min read
Source: arXiv

Source: arXiv - 2603.05366v1

Overview

The paper evaluates how the FleCSI framework—an abstraction layer that lets scientists write high‑level, task‑based code—performs when backed by three different parallel runtimes: classic MPI, and the newer asynchronous many‑task runtimes (AMTRs) Legion and HPX. By benchmarking a simple Poisson solver and a full‑featured radiation‑hydrodynamics application (HARD) on up to 1 024 nodes (≈ 131 k cores), the authors quantify the trade‑offs between ease‑of‑use and raw performance at extreme scale.

Key Contributions

  • Unified benchmark suite for a communication‑heavy (Poisson) and a compute‑heavy (radiation hydrodynamics) application, both built on FleCSI.
  • Head‑to‑head comparison of three backends (MPI, Legion, HPX) on up to 1024 nodes, exposing scaling behavior and overheads.
  • Demonstration that FleCSI’s MPI backend incurs < 3 % overhead, achieving > 97 % parallel efficiency on weak‑scaled Poisson runs.
  • Identification of Legion’s scaling bottlenecks (significant overhead, limited weak‑scale growth).
  • Evidence that HPX can match or exceed MPI+Kokkos performance for compute‑intensive workloads, provided collective operations are tuned.
  • Insight into how asynchronous tasking can hide communication latency and improve performance on modest node counts (< 64).

Methodology

  1. Framework Setup – The authors built both applications using FleCSI’s high‑level API, which automatically maps user tasks onto the chosen runtime (MPI, Legion, or HPX).
  2. Hardware & Scaling – Experiments ran on a Cray‑type cluster (dual‑socket CPUs, high‑speed interconnect) from 1 node up to 1024 nodes (≈ 131 k cores). Two scaling modes were used:
    • Weak scaling – problem size per node stays constant, testing how well the runtime handles increasing communication volume.
    • Strong scaling – total problem size fixed, measuring how efficiently the runtime reduces time‑to‑solution as more resources are added.
  3. Metrics – Parallel efficiency, total runtime, and speed‑up relative to the MPI+Kokkos baseline were recorded. The Poisson solver stresses the communication layer, while HARD stresses computation and memory bandwidth.
  4. Backend Configuration
    • MPI: standard MPI‑aware FleCSI + Kokkos for on‑node parallelism.
    • Legion: FleCSI’s Legion backend with default task graph generation.
    • HPX: FleCSI’s HPX backend, using HPX’s lightweight threads and futures for async execution; collective ops left at their current (non‑optimized) implementation.

Results & Findings

BenchmarkBackendWeak‑scale Parallel EfficiencySpeed‑up vs. MPI+Kokkos
Poisson (comm‑bound)MPI> 97 % (up to 131 072 cores)Baseline
LegionNoticeable overhead; efficiency drops sharply after ~ 256 nodes< 1.0 (slower)
HPXMarginal overhead; efficiency comparable to MPI≈ 0.98–1.02 (near parity)
HARD (compute‑bound)MPIGood scaling but slower than HPX on small node countsBaseline
HPXOutperforms MPI+Kokkos on < 64 nodes (weak: +31 % speed‑up; strong: +27 %)1.31 (weak), 1.27 (strong)
HPX (hydro‑only)Up to +20 % vs. MPI, +64 % vs. MPI+Kokkos on < 32 nodes1.20 (vs MPI), 1.64 (vs MPI+Kokkos)
Legion (HARD)LegionSimilar to MPI for small runs, but scaling stalls beyond ~ 128 nodes≈ 1.0 (small), degrades later

Key takeaways

  • MPI remains the most robust choice for pure communication‑heavy workloads; FleCSI’s abstraction adds virtually no penalty.
  • HPX shines when the workload is compute‑intensive and the node count is modest, leveraging asynchronous tasks to overlap work and communication.
  • Legion’s current implementation in FleCSI is not yet ready for large‑scale weak scaling, likely due to task‑graph overhead and sub‑optimal data movement.

Practical Implications

  • For developers of large‑scale scientific codes (e.g., astrophysics, climate, CFD), FleCSI offers a single source that can be compiled against MPI, Legion, or HPX, letting teams experiment with different runtimes without rewriting core algorithms.
  • HPX as a drop‑in replacement could be a win for applications that are compute‑bound and run on clusters with ≤ 64 nodes—common in early‑stage research or when budget constraints limit node count.
  • MPI remains the safe default for production runs on supercomputers where communication dominates (e.g., multigrid solvers, global reductions).
  • Performance‑critical sections (collective operations, reductions) may need custom tuning when using HPX; the paper’s results suggest that once HPX’s collectives are optimized, its advantage could extend to larger scales.
  • Portability: By abstracting the runtime, FleCSI reduces the engineering effort required to target emerging exascale architectures that may favor task‑based runtimes over traditional MPI.

Limitations & Future Work

  • Legion backend: The study reports significant scaling bottlenecks, indicating that FleCSI’s integration with Legion needs further optimization (e.g., better task‑graph partitioning, reduced runtime overhead).
  • HPX collectives: Current non‑optimized collective operations limit scalability beyond ~ 64 nodes; future work should benchmark HPX with its upcoming collective improvements.
  • Hardware diversity: Experiments were confined to a single CPU‑based cluster; extending the study to GPUs, ARM‑based nodes, or heterogeneous systems would clarify how FleCSI’s backends perform on emerging architectures.
  • Application breadth: Only two codes were evaluated; adding more diverse workloads (e.g., irregular graph analytics, machine‑learning pipelines) would strengthen the generality of the conclusions.
  • Energy efficiency: The paper focuses on runtime; measuring power consumption across backends could be valuable for exascale sustainability considerations.

Bottom line: FleCSI’s high‑level, task‑based programming model can deliver near‑native MPI performance while offering a pathway to experiment with modern asynchronous runtimes. For compute‑heavy scientific codes on modest node counts, HPX already shows a measurable speed‑up, whereas Legion still needs work before it can compete at scale. Developers can leverage this flexibility to future‑proof their codes as the HPC ecosystem evolves.

Authors

  • Alexander Strack
  • Hartmut Kaiser
  • Dirk Pflüger

Paper Information

  • arXiv ID: 2603.05366v1
  • Categories: cs.DC
  • Published: March 5, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »