[Paper] Matrix-PIC: Harnessing Matrix Outer-product for High-Performance Particle-in-Cell Simulations
Source: arXiv - 2601.08277v1
Overview
The paper introduces Matrix‑PIC, a novel way to speed up the particle‑in‑cell (PIC) method—one of the workhorses for plasma and accelerator simulations—by exploiting the Matrix Processing Units (MPUs) that are now being integrated into modern many‑core CPUs. By reshaping the core “deposition” step into a matrix‑centric formulation, the authors achieve order‑of‑magnitude gains over traditional CPU and even GPU implementations.
Key Contributions
- Block‑matrix deposition formulation that maps the particle‑to‑grid current accumulation directly onto MPU‑native outer‑product primitives.
- Hybrid MPU–VPU execution pipeline: MPUs handle dense matrix accumulation while VPUs (vector units) take care of data layout, particle sorting, and control flow.
- O(1) amortized incremental sorter based on a gapped packed‑memory array, preserving locality as particles move between cells without costly full re‑sorting.
- Comprehensive co‑design of algorithm, data structures, and hardware‑specific scheduling, demonstrating a holistic approach rather than a simple kernel tweak.
- Performance validation on a next‑generation HPC platform, showing up to 2.63× overall speed‑up and 8.7× acceleration of the deposition kernel compared to the best hand‑optimized vector implementation.
Methodology
-
Re‑thinking deposition as a matrix operation
- In classic PIC, each particle contributes a small stencil of current values to neighboring grid nodes, leading to many fine‑grained atomic updates.
- Matrix‑PIC groups particles into blocks and expresses the whole block’s contribution as a matrix outer product:
C = A × Bᵀ, whereAholds particle weights andBencodes stencil coefficients.
-
Hybrid execution model
- MPU stage: The outer‑product is dispatched to the MPU, which can compute dense matrix products at near‑peak throughput with minimal synchronization.
- VPU stage: Prior to MPU execution, VPUs rearrange particle data (e.g., gather positions, compute stencil indices) and after the MPU finishes, they scatter the accumulated matrix back into the global grid.
-
Incremental sorting with a gapped packed‑memory array
- Particles are stored in a gap‑enabled array that allows O(1) amortized insert/delete as particles cross cell boundaries.
- This preserves spatial locality, ensuring that each MPU block works on a compact, cache‑friendly region of the grid.
-
Implementation details
- The prototype runs on a CPU featuring a 16‑lane MPU and 512‑bit AVX‑512 VPU.
- Compiler intrinsics and a lightweight runtime scheduler orchestrate MPU/VPU hand‑offs without stalling the pipeline.
Results & Findings
| Benchmark | Baseline (CPU) | Hand‑optimized VPU | Matrix‑PIC (MPU+VPU) | Speed‑up vs. Baseline |
|---|---|---|---|---|
| LWFA total runtime | 1.00× | 1.45× | 2.63× | 2.63× |
| 3rd‑order deposition kernel | 1.00× | 2.0× | 8.7× | 8.7× |
| Achieved CPU peak | 30 % | 55 % | 83 % | — |
| CUDA (data‑center GPU) | — | — | 0.36× (i.e., 2.8× faster) | — |
- Peak utilization: Matrix‑PIC reaches 83 % of the theoretical CPU peak, a record for PIC on CPUs.
- GPU comparison: Even against a highly tuned CUDA implementation, Matrix‑PIC is ~2.8× faster, highlighting the advantage of leveraging MPUs for this workload.
Practical Implications
- Accelerator design teams can run larger, higher‑resolution laser‑wakefield or fusion simulations on commodity CPU clusters, reducing reliance on expensive GPU farms.
- Software libraries (e.g., WarpX, PIConGPU) could integrate a matrix‑centric deposition backend, offering a drop‑in performance boost for users on MPU‑enabled CPUs.
- Energy efficiency: MPUs consume less power per FLOP than GPUs for dense matrix work, potentially lowering the total cost of ownership for long‑running PIC campaigns.
- Portability: The hybrid pipeline abstracts the MPU as a “matrix accelerator,” making it feasible to map the same ideas to future heterogeneous architectures (e.g., AI‑focused tensor cores).
Limitations & Future Work
- Hardware dependence: The current implementation is tightly coupled to a specific MPU/VPU design; portability to other CPUs without MPUs will require fallback paths.
- Memory bandwidth: While the MPU handles compute efficiently, surrounding data movement (particle gather/scatter) can still become a bottleneck on systems with limited bandwidth.
- Higher‑order shapes: The paper focuses on third‑order deposition; extending the matrix formulation to even higher‑order shape functions may need more sophisticated stencil encoding.
- Scalability: Experiments were performed on a single node; scaling across multiple nodes (distributed memory) and handling load‑balancing of MPU work remain open challenges.
Overall, Matrix‑PIC demonstrates that re‑architecting classic scientific kernels around emerging matrix‑oriented hardware can unlock performance that rivals—or surpasses—GPU solutions, opening a new path for high‑performance plasma simulation on CPUs.
Authors
- Yizhuo Rao
- Xingjian Cui
- Jiabin Xie
- Shangzhi Pang
- Guangnan Feng
- Jinhui Wei
- Zhiguang Chen
- Yutong Lu
Paper Information
- arXiv ID: 2601.08277v1
- Categories: cs.DC
- Published: January 13, 2026
- PDF: Download PDF