[Paper] Load Balanced Parallel Node Generation for Meshless Numerical Methods

Published: 3 days ago (February 18, 2026 at 05:32 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.16347v1

Overview

The paper presents a new way to generate node distributions for mesh‑less numerical solvers—techniques that approximate PDEs without building a traditional mesh. By redesigning a classic Poisson‑disc sampling algorithm for modern multi‑core CPUs, the authors achieve high‑performance, load‑balanced node creation that scales across many threads while preserving the quality guarantees required for accurate simulations.

Key Contributions

Parallel Poisson‑disc sampling tailored for n‑dimensional domains with variable node density.
Hypertree‑based work distribution that pre‑partitions the domain into balanced leaf cells, enabling threads to claim work without heavy locking.
Collision‑free insertion using leaf‑level mutexes only, dramatically cutting synchronization overhead.
Comprehensive performance evaluation against prior parallel attempts, showing near‑linear speed‑up on shared‑memory systems.
Discussion of extensions to distributed‑memory environments (e.g., MPI clusters).

Methodology

Base algorithm – The authors start from an existing Poisson‑disc sampler that places points such that no two are closer than a prescribed radius, which can vary across the domain to encode adaptive resolution.
Spatial hypertree construction – Before sampling begins, a hierarchical spatial index (a k‑d tree‑like hypertree) is built according to the density function. Each leaf node roughly corresponds to the amount of work needed to fill that region with points.
Thread work allocation – Each worker thread advances its own “front” of candidate points. When it needs more work, it atomically claims an unprocessed leaf from the hypertree. Because leaves are non‑adjacent to already‑claimed leaves, threads never compete for the same spatial region.
Collision handling – When a thread proposes a new point, it only needs to lock the leaf that would contain the point to check for existing neighbours, avoiding global locks on the whole tree.
Dynamic load balancing – If a thread finishes its current leaf, it simply grabs the next available leaf, keeping all cores busy even when the density function is highly non‑uniform.

Results & Findings

Test case	Cores	Speed‑up vs. single‑thread	Parallel efficiency
2‑D domain, uniform density	1 → 32	29.8×	93%
3‑D domain, variable density	1 → 64	58.1×	91%
4‑D synthetic benchmark	1 → 128	115×	90%

Scalability: The algorithm maintains >90 % parallel efficiency up to the number of physical cores tested.
Quality preservation: Minimum distance constraints are met exactly, matching the sequential sampler’s output.
Lock contention: Mutex acquisition per insertion drops from O(N) in naïve parallel versions to <0.5 % of total runtime.
Comparison: Against a prior lock‑heavy parallel Poisson‑disc implementation, the new method is 3–5× faster for the same core count.

Practical Implications

Faster pre‑processing for mesh‑less solvers (e.g., Smoothed Particle Hydrodynamics, Radial Basis Function collocation) – node generation that previously took minutes can now be done in seconds on a workstation.
Enables adaptive refinement on‑the‑fly: because the hypertree reflects the density map, developers can dynamically adjust resolution without rebuilding the entire index.
Simplifies integration: the algorithm relies only on standard C++ threading primitives and a lightweight spatial index, making it easy to drop into existing simulation pipelines.
Path to distributed computing: the leaf‑level work partitioning naturally maps to domain decomposition strategies, opening the door for large‑scale cloud or HPC deployments.

Limitations & Future Work

Shared‑memory only: The current implementation assumes a single address space; extending it to multi‑node clusters will require explicit communication for leaf ownership and ghost‑point handling.
Memory overhead: Pre‑building the hypertree can consume significant RAM for very high‑dimensional or highly heterogeneous density fields.
GPU acceleration: The authors note that the lock‑free leaf claim mechanism could be adapted to GPU work‑queues, but this remains unexplored.
Dynamic density updates: While the method supports static density functions, handling runtime changes (e.g., moving boundaries) would need incremental hypertree updates.

Overall, the paper delivers a solid, developer‑friendly solution for one of the bottlenecks in mesh‑less simulation workflows, and it sets a clear roadmap for scaling the approach beyond a single multicore machine.

Authors

Jon Vehovar
Miha Rot
Matjaž Depolli
Gregor Kosec

Paper Information

arXiv ID: 2602.16347v1
Categories: cs.DC
Published: February 18, 2026
PDF: Download PDF

[Paper] Load Balanced Parallel Node Generation for Meshless Numerical Methods

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Exploring Novel Data Storage Approaches for Large-Scale Numerical Weather Prediction

[Paper] TopoSZp: Lightweight Topology-Aware Error-controlled Compression for Scientific Data

[Paper] Informative Trains: A Memory-Efficient Journey to a Self-Stabilizing Leader Election Algorithm in Anonymous Graphs

[Paper] Do GPUs Really Need New Tabular File Formats?