[Paper] Matrix-PIC: Harnessing Matrix Outer-product for High-Performance Particle-in-Cell Simulations

Published: 3 weeks ago (January 13, 2026 at 02:11 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.08277v1

Overview

The paper introduces Matrix‑PIC, a novel way to speed up the particle‑in‑cell (PIC) method—one of the workhorses for plasma and accelerator simulations—by exploiting the Matrix Processing Units (MPUs) that are now being integrated into modern many‑core CPUs. By reshaping the core “deposition” step into a matrix‑centric formulation, the authors achieve order‑of‑magnitude gains over traditional CPU and even GPU implementations.

Key Contributions

Block‑matrix deposition formulation that maps the particle‑to‑grid current accumulation directly onto MPU‑native outer‑product primitives.
Hybrid MPU–VPU execution pipeline: MPUs handle dense matrix accumulation while VPUs (vector units) take care of data layout, particle sorting, and control flow.
O(1) amortized incremental sorter based on a gapped packed‑memory array, preserving locality as particles move between cells without costly full re‑sorting.
Comprehensive co‑design of algorithm, data structures, and hardware‑specific scheduling, demonstrating a holistic approach rather than a simple kernel tweak.
Performance validation on a next‑generation HPC platform, showing up to 2.63× overall speed‑up and 8.7× acceleration of the deposition kernel compared to the best hand‑optimized vector implementation.

Methodology

Re‑thinking deposition as a matrix operation
- In classic PIC, each particle contributes a small stencil of current values to neighboring grid nodes, leading to many fine‑grained atomic updates.
- Matrix‑PIC groups particles into blocks and expresses the whole block’s contribution as a matrix outer product: C = A × Bᵀ, where A holds particle weights and B encodes stencil coefficients.
Hybrid execution model
- MPU stage: The outer‑product is dispatched to the MPU, which can compute dense matrix products at near‑peak throughput with minimal synchronization.
- VPU stage: Prior to MPU execution, VPUs rearrange particle data (e.g., gather positions, compute stencil indices) and after the MPU finishes, they scatter the accumulated matrix back into the global grid.
Incremental sorting with a gapped packed‑memory array
- Particles are stored in a gap‑enabled array that allows O(1) amortized insert/delete as particles cross cell boundaries.
- This preserves spatial locality, ensuring that each MPU block works on a compact, cache‑friendly region of the grid.
Implementation details
- The prototype runs on a CPU featuring a 16‑lane MPU and 512‑bit AVX‑512 VPU.
- Compiler intrinsics and a lightweight runtime scheduler orchestrate MPU/VPU hand‑offs without stalling the pipeline.

Results & Findings

Benchmark	Baseline (CPU)	Hand‑optimized VPU	Matrix‑PIC (MPU+VPU)	Speed‑up vs. Baseline
LWFA total runtime	1.00×	1.45×	2.63×	2.63×
3rd‑order deposition kernel	1.00×	2.0×	8.7×	8.7×
Achieved CPU peak	30 %	55 %	83 %	—
CUDA (data‑center GPU)	—	—	0.36× (i.e., 2.8× faster)	—

Peak utilization: Matrix‑PIC reaches 83 % of the theoretical CPU peak, a record for PIC on CPUs.
GPU comparison: Even against a highly tuned CUDA implementation, Matrix‑PIC is ~2.8× faster, highlighting the advantage of leveraging MPUs for this workload.

Practical Implications

Accelerator design teams can run larger, higher‑resolution laser‑wakefield or fusion simulations on commodity CPU clusters, reducing reliance on expensive GPU farms.
Software libraries (e.g., WarpX, PIConGPU) could integrate a matrix‑centric deposition backend, offering a drop‑in performance boost for users on MPU‑enabled CPUs.
Energy efficiency: MPUs consume less power per FLOP than GPUs for dense matrix work, potentially lowering the total cost of ownership for long‑running PIC campaigns.
Portability: The hybrid pipeline abstracts the MPU as a “matrix accelerator,” making it feasible to map the same ideas to future heterogeneous architectures (e.g., AI‑focused tensor cores).

Limitations & Future Work

Hardware dependence: The current implementation is tightly coupled to a specific MPU/VPU design; portability to other CPUs without MPUs will require fallback paths.
Memory bandwidth: While the MPU handles compute efficiently, surrounding data movement (particle gather/scatter) can still become a bottleneck on systems with limited bandwidth.
Higher‑order shapes: The paper focuses on third‑order deposition; extending the matrix formulation to even higher‑order shape functions may need more sophisticated stencil encoding.
Scalability: Experiments were performed on a single node; scaling across multiple nodes (distributed memory) and handling load‑balancing of MPU work remain open challenges.

Overall, Matrix‑PIC demonstrates that re‑architecting classic scientific kernels around emerging matrix‑oriented hardware can unlock performance that rivals—or surpasses—GPU solutions, opening a new path for high‑performance plasma simulation on CPUs.

Authors

Yizhuo Rao
Xingjian Cui
Jiabin Xie
Shangzhi Pang
Guangnan Feng
Jinhui Wei
Zhiguang Chen
Yutong Lu

Paper Information

arXiv ID: 2601.08277v1
Categories: cs.DC
Published: January 13, 2026
PDF: Download PDF

[Paper] Matrix-PIC: Harnessing Matrix Outer-product for High-Performance Particle-in-Cell Simulations

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Space-Optimal, Computation-Optimal, Topology-Agnostic, Throughput-Scalable Causal Delivery through Hybrid Buffering

[Paper] Konflux: Optimized Function Fusion for Serverless Applications

[Paper] AFLL: Real-time Load Stabilization for MMO Game Servers Based on Circular Causality Learning

[Paper] Breaking the Storage-Bandwidth Tradeoff in Distributed Storage with Quantum Entanglement