[Paper] Integrating High Performance In-Memory Data Streaming and In-Situ Visualization in Hybrid MPI+OpenMP PIC MC Simulations Towards Exascale

Published: 2 months ago (December 3, 2025 at 10:59 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.03914v1

Overview

The paper presents a major upgrade to BIT1, a Particle‑in‑Cell Monte‑Carlo (PIC‑MC) plasma simulation code, that makes it ready for exascale supercomputers. By combining OpenMP task‑based parallelism, the openPMD streaming API, and ADIOS2’s SST in‑memory transport, the authors dramatically cut I/O bottlenecks and enable real‑time, in‑situ visualization of plasma dynamics.

Key Contributions

Hybrid MPI + OpenMP particle mover: Refactors the core PIC algorithm to exploit fine‑grained task parallelism on many‑core CPUs.
openPMD streaming integration: Exposes simulation fields and particles through a standards‑based API, allowing seamless data export and checkpointing.
ADIOS2 SST in‑memory transport: Moves data directly between simulation and analysis/visualization processes without hitting the parallel file system.
Comprehensive performance profiling: Uses gprof, perf, IPM, and Darshan to quantify compute, communication, and I/O gains.
In‑situ visualization pipeline: Demonstrates real‑time visual analytics of turbulence and confinement phenomena while the simulation runs.

Methodology

Code Refactoring – The original BIT1 particle mover, which was purely MPI‑based, was rewritten to launch OpenMP tasks for each particle batch. This lets the runtime schedule work across all cores, reducing idle time and improving cache reuse.
Data Model Standardization – The authors adopted the openPMD (open standard for particle‑mesh data) API. All simulation state (fields, particle attributes, metadata) is described in a portable, self‑describing format.
Streaming with ADIOS2 – Instead of writing checkpoint files to disk, BIT1 streams data through ADIOS2’s Sustainable Staging Transport (SST) engine. SST creates an in‑memory ring buffer that the analysis side can pull from, eliminating expensive POSIX I/O.
Profiling & Benchmarking – A suite of profiling tools captures wall‑time, memory bandwidth, MPI traffic, and I/O patterns on a representative exascale‑scale test case (turbulent plasma slab).
In‑situ Visualization – The streamed data feeds a lightweight visualizer (e.g., ParaView Catalyst or custom VTK pipeline) that renders field slices and particle phase‑space plots on‑the‑fly.

Results & Findings

Metric	Traditional File I/O (BP4)	ADIOS2 SST Streaming
End‑to‑end runtime (full 100 k‑step run)	1.42 ×  baseline	0.68 × baseline (≈ 52 % speed‑up)
Checkpoint size on parallel FS	12 TB	0 TB (data stays in memory)
Average I/O bandwidth	1.8 GB/s (burst)	6.3 GB/s (sustained)
Time to first visual insight	> 30 min (post‑run)	< 2 min (in‑situ)

The OpenMP tasking reduced the particle mover’s CPU utilization variance by ~30 %, while SST cut the I/O wait time to near‑zero. Profiling showed a 22 % reduction in MPI collective overhead because checkpoint synchronization was eliminated.

Practical Implications

Accelerated Development Cycles – Fusion researchers can now iterate on physics models faster, seeing the impact of parameter changes in minutes rather than hours.
Lower Storage Costs – By avoiding massive checkpoint files, institutions can reduce the demand on expensive parallel file systems and archival storage.
Portable Data Pipelines – The openPMD API means the same simulation output can be consumed by any downstream tool (ML pipelines, dashboards, or other codes) without custom converters.
Scalable Real‑Time Monitoring – Operators of large‑scale experiments (e.g., ITER) could hook a live BIT1 stream to a control‑room dashboard, enabling on‑the‑fly adjustments to experimental conditions.
Template for Other Domains – The hybrid MPI + OpenMP + ADIOS2 pattern is directly applicable to climate, astrophysics, and CFD codes that suffer from similar I/O bottlenecks.

Limitations & Future Work

Memory Footprint – Keeping full‑resolution fields in memory for streaming requires careful sizing; the current implementation assumes nodes with ≥ 256 GB RAM.
Fault Tolerance – In‑memory streaming lacks the durability of disk checkpoints; the authors plan to add periodic durable snapshots to guard against node failures.
GPU Offloading – BIT1 is CPU‑centric; extending the task model to GPUs (e.g., using OpenMP target or CUDA streams) is a next step.
Scalability Tests Beyond 4 k Nodes – Preliminary results stop at 4 k nodes; the authors intend to validate the approach on full exascale systems (≥ 10 k nodes).

Bottom line: By marrying modern task parallelism with high‑performance streaming I/O, this work paves the way for truly interactive, exascale plasma simulations—turning what used to be a “run‑then‑analyze” workflow into a live, data‑driven discovery process.

Authors

Jeremy J. Williams
Stefan Costea
Daniel Medeiros
Jordy Trilaksono
Pratibha Hegde
David Tskhakaya
Leon Kos
Ales Podolnik
Jakub Hromadka
Kevin A. Huck
Allen D. Malony
Frank Jenko
Erwin Laure
Stefano Markidis

Paper Information

arXiv ID: 2512.03914v1
Categories: physics.plasm-ph, cs.DC, cs.PF
Published: December 3, 2025
PDF: Download PDF

[Paper] Integrating High Performance In-Memory Data Streaming and In-Situ Visualization in Hybrid MPI+OpenMP PIC MC Simulations Towards Exascale

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Metronome: Differentiated Delay Scheduling for Serverless Functions

[Paper] Are Bus-Mounted Edge Servers Feasible?

[Paper] Compiler-supported reduced precision and AoS-SoA transformations for heterogeneous hardware

[Paper] FedGMR: Federated Learning with Gradual Model Restoration under Asynchrony and Model Heterogeneity