[Paper] Integrating High Performance In-Memory Data Streaming and In-Situ Visualization in Hybrid MPI+OpenMP PIC MC Simulations Towards Exascale

Published: (December 3, 2025 at 10:59 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.03914v1

Overview

The paper presents a major upgrade to BIT1, a Particle‑in‑Cell Monte‑Carlo (PIC‑MC) plasma simulation code, that makes it ready for exascale supercomputers. By combining OpenMP task‑based parallelism, the openPMD streaming API, and ADIOS2’s SST in‑memory transport, the authors dramatically cut I/O bottlenecks and enable real‑time, in‑situ visualization of plasma dynamics.

Key Contributions

  • Hybrid MPI + OpenMP particle mover: Refactors the core PIC algorithm to exploit fine‑grained task parallelism on many‑core CPUs.
  • openPMD streaming integration: Exposes simulation fields and particles through a standards‑based API, allowing seamless data export and checkpointing.
  • ADIOS2 SST in‑memory transport: Moves data directly between simulation and analysis/visualization processes without hitting the parallel file system.
  • Comprehensive performance profiling: Uses gprof, perf, IPM, and Darshan to quantify compute, communication, and I/O gains.
  • In‑situ visualization pipeline: Demonstrates real‑time visual analytics of turbulence and confinement phenomena while the simulation runs.

Methodology

  1. Code Refactoring – The original BIT1 particle mover, which was purely MPI‑based, was rewritten to launch OpenMP tasks for each particle batch. This lets the runtime schedule work across all cores, reducing idle time and improving cache reuse.
  2. Data Model Standardization – The authors adopted the openPMD (open standard for particle‑mesh data) API. All simulation state (fields, particle attributes, metadata) is described in a portable, self‑describing format.
  3. Streaming with ADIOS2 – Instead of writing checkpoint files to disk, BIT1 streams data through ADIOS2’s Sustainable Staging Transport (SST) engine. SST creates an in‑memory ring buffer that the analysis side can pull from, eliminating expensive POSIX I/O.
  4. Profiling & Benchmarking – A suite of profiling tools captures wall‑time, memory bandwidth, MPI traffic, and I/O patterns on a representative exascale‑scale test case (turbulent plasma slab).
  5. In‑situ Visualization – The streamed data feeds a lightweight visualizer (e.g., ParaView Catalyst or custom VTK pipeline) that renders field slices and particle phase‑space plots on‑the‑fly.

Results & Findings

MetricTraditional File I/O (BP4)ADIOS2 SST Streaming
End‑to‑end runtime (full 100 k‑step run)1.42 ×  baseline0.68 × baseline (≈ 52 % speed‑up)
Checkpoint size on parallel FS12 TB0 TB (data stays in memory)
Average I/O bandwidth1.8 GB/s (burst)6.3 GB/s (sustained)
Time to first visual insight> 30 min (post‑run)< 2 min (in‑situ)

The OpenMP tasking reduced the particle mover’s CPU utilization variance by ~30 %, while SST cut the I/O wait time to near‑zero. Profiling showed a 22 % reduction in MPI collective overhead because checkpoint synchronization was eliminated.

Practical Implications

  • Accelerated Development Cycles – Fusion researchers can now iterate on physics models faster, seeing the impact of parameter changes in minutes rather than hours.
  • Lower Storage Costs – By avoiding massive checkpoint files, institutions can reduce the demand on expensive parallel file systems and archival storage.
  • Portable Data Pipelines – The openPMD API means the same simulation output can be consumed by any downstream tool (ML pipelines, dashboards, or other codes) without custom converters.
  • Scalable Real‑Time Monitoring – Operators of large‑scale experiments (e.g., ITER) could hook a live BIT1 stream to a control‑room dashboard, enabling on‑the‑fly adjustments to experimental conditions.
  • Template for Other Domains – The hybrid MPI + OpenMP + ADIOS2 pattern is directly applicable to climate, astrophysics, and CFD codes that suffer from similar I/O bottlenecks.

Limitations & Future Work

  • Memory Footprint – Keeping full‑resolution fields in memory for streaming requires careful sizing; the current implementation assumes nodes with ≥ 256 GB RAM.
  • Fault Tolerance – In‑memory streaming lacks the durability of disk checkpoints; the authors plan to add periodic durable snapshots to guard against node failures.
  • GPU Offloading – BIT1 is CPU‑centric; extending the task model to GPUs (e.g., using OpenMP target or CUDA streams) is a next step.
  • Scalability Tests Beyond 4 k Nodes – Preliminary results stop at 4 k nodes; the authors intend to validate the approach on full exascale systems (≥ 10 k nodes).

Bottom line: By marrying modern task parallelism with high‑performance streaming I/O, this work paves the way for truly interactive, exascale plasma simulations—turning what used to be a “run‑then‑analyze” workflow into a live, data‑driven discovery process.

Authors

  • Jeremy J. Williams
  • Stefan Costea
  • Daniel Medeiros
  • Jordy Trilaksono
  • Pratibha Hegde
  • David Tskhakaya
  • Leon Kos
  • Ales Podolnik
  • Jakub Hromadka
  • Kevin A. Huck
  • Allen D. Malony
  • Frank Jenko
  • Erwin Laure
  • Stefano Markidis

Paper Information

  • arXiv ID: 2512.03914v1
  • Categories: physics.plasm-ph, cs.DC, cs.PF
  • Published: December 3, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »