[Paper] Shifting the Sweet Spot: High-Performance Matrix-Free Method for High-Order Elasticity
Source: arXiv - 2601.08374v1
Overview
High‑order finite‑element (FE) simulations for elasticity are notoriously memory‑hungry when the stiffness matrix is assembled explicitly. Matrix‑free (or partial assembly, PA) techniques avoid this bottleneck, but current PA implementations still hit their performance “sweet spot” at very low polynomial orders ( p ≈ 2 ), leaving the promised speed‑ups of high‑order methods largely untapped. This paper presents a set of low‑level, architecture‑aware optimizations that push the sweet spot up to p ≥ 6, delivering order‑of‑magnitude gains on commodity x86 and ARM CPUs.
Key Contributions
- Tensor‑factorized PA kernel: Replaces the generic O(p⁶) algorithm with an O(p⁴) form that exploits the tensor‑product structure of high‑order elements.
- Voigt‑symmetry exploitation: Removes redundant arithmetic specific to elasticity, cutting the operation count further.
- Macro‑kernel fusion: Merges multiple small kernels into a single, cache‑friendly macro‑kernel, dramatically improving data locality and alleviating memory‑bandwidth limits.
- Deep integration with MFEM & Geometric Multigrid (GMG): The optimized operator is plugged into the widely‑used MFEM library and works seamlessly with a GMG preconditioner, preserving solver robustness.
- Comprehensive performance evaluation: Benchmarks on modern x86 (Intel Xeon, AMD Zen) and ARM (Neoverse N1) CPUs show 7×–83× kernel speed‑ups and 3.6×–16.8× end‑to‑end solution acceleration over the MFEM baseline.
Methodology
- Identify the bottleneck – Profiling the existing PA implementation in MFEM revealed that the dominant cost stems from the naïve O(p⁶) loop over element degrees of freedom, which quickly becomes memory‑bound.
- Tensor factorization – Rewriting the element‑wise operator as a sequence of 1‑D tensor contractions (leveraging the tensor‑product basis) reduces the computational complexity to O(p⁴), matching the theoretical optimum for hexahedral elements.
- Elasticity‑specific simplifications – The stiffness tensor for linear elasticity exhibits Voigt symmetry (many entries are equal). Pruning duplicated FLOPs reduces per‑element work without sacrificing accuracy.
- Macro‑kernel design – Instead of launching many tiny kernels (one per element or per contraction), the approach fuses them into a larger “macro‑kernel” that processes blocks of elements in a single pass. This improves cache reuse and reduces kernel launch overhead, turning a memory‑bandwidth bound problem into a compute‑bound one on modern CPUs.
- Integration & validation – The optimized PA operator is inserted into MFEM’s existing GMG preconditioner pipeline. Numerical correctness is verified (same error norms as the baseline), followed by strong‑ and weak‑scaling tests on representative elasticity problems (e.g., cantilever beam, 3‑D block).
Results & Findings
| Platform | Polynomial order (p) | Kernel speed‑up vs. MFEM baseline | End‑to‑end solver speed‑up |
|---|---|---|---|
| Intel Xeon 8175M | 6 | 12× | 5.4× |
| AMD EPYC 7742 | 8 | 23× | 9.1× |
| ARM Neoverse N1 | 10 | 83× | 16.8× |
| … (additional configs) | … | … | … |
- Sweet‑spot shift: The original MFEM PA kernel peaked at p = 2, whereas the optimized version continues to improve up to p = 10 (and beyond), confirming the theoretical benefit of high‑order discretizations.
- Memory bandwidth reduction: Roofline analysis shows the optimized kernel moves from the bandwidth‑limited region to the compute‑limited region on all tested CPUs.
- Scalability: Strong‑scaling experiments (up to 64 cores) retain > 80 % parallel efficiency, indicating that the macro‑kernel does not introduce contention.
Practical Implications
- Faster high‑order simulations – Engineers can now run large‑scale elasticity analyses (e.g., automotive crash, aerospace structural optimization) with p ≥ 6 without prohibitive runtimes, unlocking the higher accuracy per degree of freedom that high‑order methods promise.
- Reduced hardware costs – By squeezing more performance out of existing CPUs, organizations can defer or avoid expensive GPU or accelerator purchases for many elasticity workloads.
- Ease of adoption – Because the work is contributed back to MFEM, a popular open‑source FE library, developers can drop‑in the optimized operator with minimal code changes.
- Template for other PDEs – The same tensor‑factorization + macro‑kernel fusion strategy can be applied to matrix‑free implementations of other tensor‑product‑based discretizations (e.g., fluid dynamics, electromagnetics).
Limitations & Future Work
- Element type restriction – Optimizations target hexahedral (tensor‑product) elements; extending to simplicial meshes would require a different factorization strategy.
- GPU/accelerator support – The paper focuses on CPUs; porting the macro‑kernel fusion approach to CUDA/ROCm remains an open challenge.
- Non‑linear elasticity & large deformations – The study assumes linear elasticity; handling material or geometric non‑linearity may introduce data dependencies that diminish current gains.
- Automatic tuning – The current implementation uses hand‑tuned block sizes; integrating an auto‑tuner (e.g., using LLVM or ATLAS) could make the approach portable across even more diverse architectures.
Bottom line: By rethinking the low‑level implementation of matrix‑free elasticity operators, this work demonstrates that high‑order finite‑element methods can finally deliver their promised performance on everyday CPUs, opening the door for more accurate and faster simulations in industry and research.
Authors
- Dali Chang
- Chong Zhang
- Kaiqi Zhang
- Mingguan Yang
- Huiyuan Li
- Weiqiang Kong
Paper Information
- arXiv ID: 2601.08374v1
- Categories: cs.DC, cs.PF
- Published: January 13, 2026
- PDF: Download PDF