[Paper] Learning-Augmented Performance Model for Tensor Product Factorization in High-Order FEM

Published: (January 11, 2026 at 07:20 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.06886v1

Overview

This paper tackles a long‑standing pain point for developers of high‑order finite element method (FEM) solvers: predicting the runtime of the tensor‑product factorization kernels that dominate the compute cost. Traditional performance models (Roofline, ECM) assume memory‑bandwidth limits, which breaks down for arithmetic‑heavy kernels on modern CPUs such as the Fujitsu A64FX or Intel Xeon. The authors propose a learning‑augmented analytical model that combines a dependency‑chain analysis of the loop‑splitting strategy with a lightweight XGBoost predictor, delivering order‑of‑magnitude more accurate runtime estimates.

Key Contributions

  • Dependency‑chain analytical formulation that maps loop‑splitting configurations of the tensor n-mode product to instruction‑level dependencies and critical path length.
  • Hybrid learning‑augmented model: uses the analytical formulation for the structural part and XGBoost to infer hard‑to‑model parameters (e.g., SIMD latency, micro‑architectural effects).
  • Comprehensive evaluation on two very different architectures (Fujitsu A64FX and Intel Xeon Gold 6230) across polynomial orders P = 1–15, showing MAPE as low as 1 % and consistently outperforming Roofline and ECM.
  • Open‑source implementation (released with the paper) that can be plugged into existing build‑time autotuning pipelines.

Methodology

  1. Kernel Characterization – The authors start from the tensor‑product factorization kernel used in sum‑factorization for high‑order FEM. The kernel consists of a series of nested loops whose body can be split in multiple ways (e.g., splitting the innermost loop to expose more SIMD parallelism).
  2. Dependency‑Chain Model – By constructing a directed acyclic graph of instruction dependencies for each splitting configuration, they derive an analytical expression for the critical path length (the minimal number of cycles assuming perfect pipelining). This captures the impact of SIMD latency and instruction‑level parallelism, which are invisible to bandwidth‑centric models.
  3. Parameter Estimation via XGBoost – Certain constants in the analytical expression (e.g., effective latency of fused‑multiply‑add, cache‑miss penalties that depend on data layout) are difficult to model analytically. The authors train a small XGBoost regressor on a curated set of micro‑benchmarks (different P, thread counts, and split factors) to predict these parameters.
  4. Model Integration – The final runtime estimate is the sum of the analytically derived critical‑path cycles multiplied by the learned parameters, scaled by the clock frequency.
  5. Validation – They compare predictions against wall‑clock measurements for a suite of polynomial orders and splitting strategies on both target CPUs.

Results & Findings

ProcessorPolynomial Order PMAPE (Learning‑augmented)MAPE (Roofline)MAPE (ECM)
Fujitsu A64FX1‑151 % – 24 %42 % – 256 %5 % – 117 %
Intel Xeon Gold 62301‑151 % – 13 % (up to 24 % at P=15)1 % – 73 %8 % – 112 %
  • The learning‑augmented model consistently tracks the measured runtime within a few percent, even for the most compute‑intensive configurations (P = 15).
  • Roofline dramatically overestimates runtime for A64FX because it assumes a memory‑bound regime that never materializes for these kernels.
  • ECM improves over Roofline but still fails to capture the latency‑dominated critical path introduced by aggressive loop splitting.

Practical Implications

  • Autotuning Made Faster – Developers can now evaluate dozens of loop‑splitting configurations without running each variant on the target machine, dramatically shrinking the search space for performance‑critical kernels.
  • Portability Across Architectures – Because the model learns micro‑architectural parameters, the same analytical backbone can be reused on new CPUs (e.g., upcoming ARM‑based HPC nodes) with only a small calibration run.
  • Compiler‑Assisted Optimization – The dependency‑chain analysis can be integrated into compiler passes (e.g., LLVM’s loop‑vectorizer) to guide SIMD width selection and unroll factors for tensor‑product kernels.
  • Predictive Scheduling – HPC job schedulers could use the model to estimate node‑level runtime for FEM workloads, improving queue‑time predictions and resource allocation.

Limitations & Future Work

  • Training Overhead – The XGBoost component requires a modest set of benchmark runs per architecture; completely unseen CPUs would still need a calibration phase.
  • Scope Limited to Tensor n-Mode Product – While the methodology is general, the current implementation only covers sum‑factorization kernels; extending to other high‑order operators (e.g., matrix‑free preconditioners) remains future work.
  • Static Analysis Assumptions – The analytical model assumes a fixed thread count and neglects dynamic effects such as OS jitter or NUMA contention, which could degrade accuracy on heavily loaded systems.
  • Potential for Deep Learning – The authors suggest exploring richer neural models to capture non‑linear interactions between loop‑splitting parameters and hardware counters, possibly reducing the need for handcrafted analytical terms.

Authors

  • Xuanzhengbo Ren
  • Yuta Kawai
  • Tetsuya Hoshino
  • Hirofumi Tomita
  • Takahiro Katagiri
  • Daichi Mukunoki
  • Seiya Nishizawa

Paper Information

  • arXiv ID: 2601.06886v1
  • Categories: cs.DC, cs.PF
  • Published: January 11, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »