[Paper] GPU-Accelerated Simulations of Problems with Moving Boundaries and Fluid-Structure Interaction at Extreme Scales

Published: 5 days ago (May 5, 2026 at 06:41 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.04335v1

Overview

The paper presents a GPU‑optimized implementation of the sharp‑interface immersed‑boundary (IB) method, enabling large‑scale simulations of fluid flow around both static and moving bodies. By leveraging modern GPU programming models (OpenACC, CUDA, NCCL) and MPI, the authors achieve dramatic speedups and excellent scaling on problems ranging from tens of millions to a billion grid points—making high‑fidelity fluid‑structure interaction (FSI) studies feasible on current supercomputing clusters.

Key Contributions

GPU‑accelerated IB solver built on the ViCar3D framework, using a hybrid OpenACC/CUDA approach for maximum portability and performance.
Multi‑GPU orchestration via CUDA streams and NCCL communicators, delivering >90 % strong and weak scaling efficiency across many nodes.
20× speedup over the best existing CPU‑only implementation for comparable problem sizes.
Demonstration on an extreme‑scale FSI case: turbulent flow over a flapping bat wing at Reynolds number 5,000, showcasing the ability to handle complex, deforming geometries in realistic regimes.
Comprehensive performance evaluation spanning grid sizes from ~10 M to ~1 B cells, providing a clear roadmap for scaling to future exascale systems.

Methodology

The authors adopt the sharp‑interface immersed‑boundary method, which embeds the geometry of solid bodies directly into a Cartesian fluid grid, avoiding costly body‑fitted meshing. Their implementation follows these steps:

Domain discretization on a uniform Cartesian mesh; solid surfaces are represented by a set of Lagrangian markers.
Force spreading and velocity interpolation between the Eulerian fluid grid and Lagrangian markers using regularized delta functions.
Navier–Stokes solver (explicit/implicit time integration) advances the fluid field on the GPU, with pressure Poisson solves handled by a GPU‑friendly multigrid or conjugate‑gradient routine.
GPU parallelism is achieved with OpenACC directives for most loops, while performance‑critical kernels (e.g., force spreading, interpolation, Poisson solve) are hand‑tuned in CUDA.
Multi‑node scaling relies on MPI for domain decomposition and NCCL for GPU‑to‑GPU communication, with overlapping computation and communication via CUDA streams.

The design keeps the code portable (OpenACC works on a range of accelerators) yet allows low‑level CUDA optimizations where needed.

Results & Findings

Performance: For a 256³ grid (~16 M cells), the GPU version runs ~20× faster than the CPU baseline. Scaling tests up to a 1024³ grid (~1 B cells) maintain >90 % parallel efficiency on up to 64 GPUs.
Strong scaling: Doubling the GPU count halves the wall‑clock time up to the point where communication overhead (NCCL) becomes dominant.
Weak scaling: Keeping the per‑GPU workload constant, total runtime stays nearly flat as the problem size grows, confirming the effectiveness of the domain decomposition and NCCL communication strategy.
Physical validation: The flapping bat‑wing simulation reproduces expected turbulent structures and wing‑induced vortex shedding at Re = 5 000, demonstrating that the speed gains do not compromise solution fidelity.

Practical Implications

Accelerated design cycles for aerospace, automotive, and bio‑inspired engineering where FSI is a bottleneck (e.g., wing morphing, propeller‑blade interaction).
Enabling real‑time or near‑real‑time analysis for virtual testing platforms, thanks to the order‑of‑magnitude speedup.
Cost‑effective high‑resolution CFD: Organizations can achieve billion‑cell simulations on a modest GPU cluster rather than a massive CPU farm, reducing both capital and energy expenditures.
Framework extensibility: Because the core is built on ViCar3D and uses standard GPU programming models, developers can integrate additional physics (heat transfer, multiphase flow) or couple with machine‑learning surrogates without rewriting the whole solver.
Educational value: The open‑source‑friendly approach (OpenACC + CUDA) offers a practical template for research groups looking to modernize legacy CFD codes for GPU architectures.

Limitations & Future Work

Memory footprint: Uniform Cartesian grids can become memory‑hungry for very high‑resolution simulations; adaptive mesh refinement is not yet supported.
Communication bottlenecks: At extreme node counts, NCCL overhead starts to dominate; further algorithmic redesign (e.g., communication‑avoiding solvers) could push scaling further.
Complex material models: The current implementation focuses on rigid or simple elastic bodies; richer structural dynamics (nonlinear hyperelasticity, fluid‑structure coupling with large deformations) remain to be integrated.
Portability beyond NVIDIA GPUs: While OpenACC offers some cross‑vendor support, the hand‑tuned CUDA kernels limit immediate use on AMD or Intel GPUs; future work could abstract these kernels via SYCL or Kokkos.

Overall, the paper demonstrates that with careful GPU‑centric design, even the most demanding fluid‑structure interaction problems can be tackled at scales previously reserved for massive CPU supercomputers. This opens the door for faster, more detailed simulations across a wide range of engineering and scientific domains.

Authors

Sushrut Kumar
Joshua Romero
Jung‑Hee Seo
Massimiliano Fatica
Rajat Mittal

Paper Information

arXiv ID: 2605.04335v1
Categories: physics.comp-ph, cs.DC, physics.flu-dyn
Published: May 5, 2026
PDF: Download PDF

[Paper] GPU-Accelerated Simulations of Problems with Moving Boundaries and Fluid-Structure Interaction at Extreme Scales

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Stencil Computations on Cerebras Wafer-Scale Engine

[Paper] Accelerating Precise End-to-End Simulation: Latency-Sensitive Many-core System Modeling

[Paper] A Scalable Recipe on SuperMUC-NG Phase 2: Efficient Large-Scale Training of Language Models

[Paper] Stencil Computations on Tenstorrent Wormhole