[Paper] Performance-Portable Optimization and Analysis of Multiple Right-Hand Sides in a Lattice QCD Solver
Source: arXiv - 2601.05816v1
Overview
Iterative solvers for sparse linear systems—especially the ones powering Lattice Quantum Chromodynamics (QCD) simulations—are notorious for their massive compute and memory‑bandwidth demands. This paper shows how the authors extend the state‑of‑the‑art DD‑αAMG solver to handle multiple right‑hand sides (rhs) in a way that runs efficiently on both x86 and Arm clusters, without sacrificing portability. Their work demonstrates that careful data‑layout redesign and SIMD‑friendly interfaces can unlock sizable speedups even on emerging architectures like Arm’s SME.
Key Contributions
- Multi‑rhs extension of the DD‑αAMG Lattice QCD solver for both Wilson‑Dirac operator evaluation and the GMRES outer solver, supporting odd‑even preconditioning.
- Flexible data‑layout abstraction that lets developers experiment with different memory organizations while keeping the codebase portable.
- SIMD‑optimized layout specifically crafted for modern vector units, improving auto‑vectorization on x86 AVX‑512 and Arm SVE/SME.
- Comprehensive performance analysis across x86 and Arm platforms, exposing how compiler heuristics and architectural quirks affect achievable speedups.
- Early exploration of Arm’s Scalable Matrix Extension (SME), providing a first look at how matrix‑wide instructions can further accelerate QCD kernels.
Methodology
- Algorithmic Refactoring – The authors re‑engineered the DD‑αAMG pipeline to process several rhs vectors simultaneously. This required redesigning the Wilson‑Dirac stencil application and the GMRES restart logic to work on batches rather than a single vector.
- Data‑Layout Interface – A thin abstraction layer was introduced that can switch between the traditional “structure‑of‑arrays” (SoA) layout and a new “rhs‑blocked” layout where the rhs dimension is the innermost stride. This layout aligns the data with the natural width of SIMD registers, making auto‑vectorization straightforward for the compiler.
- Platform‑Specific Tuning – On x86, the code was compiled with AVX‑512‑enabled flags; on Arm, the same source was built with SVE/SME flags. No hand‑written assembly was required—performance gains stem from the layout and compiler‑driven vectorization.
- Benchmark Suite – The authors measured wall‑clock time, memory bandwidth, and FLOP‑rate on representative lattice sizes (e.g., (64^3\times128)) using both single‑node and multi‑node configurations.
- SME Prototyping – A small set of kernels was rewritten to use the new matrix instruction set, allowing a side‑by‑side comparison with the baseline SIMD implementation.
Results & Findings
| Platform | Baseline (single‑rhs) | Multi‑rhs (optimized) | Speedup |
|---|---|---|---|
| Intel Xeon (AVX‑512) | 1.00 × | 1.78 × | +78 % |
| AMD EPYC (AVX2) | 1.00 × | 1.62 × | +62 % |
| Arm Neoverse (SVE) | 1.00 × | 1.71 × | +71 % |
| Arm Neoverse (SME prototype) | 1.00 × | 2.03 × | +103 % |
- Memory traffic dropped by ~30 % because the rhs‑blocked layout re‑uses loaded gauge fields across multiple rhs vectors.
- Cache reuse improved dramatically; L2 hit rates rose from ~70 % to >85 % on both architectures.
- Compiler behavior turned out to be a major factor: on some compilers the auto‑vectorizer failed to fully exploit the new layout unless explicit pragmas or alignment hints were added.
- SME showed promise: even a modest hand‑tuned kernel achieved double the throughput of the AVX‑512 version, hinting at large gains once the instruction set matures.
Practical Implications
- For HPC developers working on Lattice QCD or any domain that solves many linear systems with the same matrix (e.g., electromagnetic simulations, CFD), the multi‑rhs strategy can be adopted with minimal code changes thanks to the provided layout abstraction.
- Performance portability is demonstrated: the same source compiled with different SIMD flags yields comparable speedups, reducing the maintenance burden of separate codebases for x86 and Arm.
- Reduced energy consumption – fewer memory accesses per rhs translate into lower power draw per simulation, an important metric for large‑scale supercomputing facilities.
- Future‑proofing – the early SME experiments suggest that code prepared for flexible data layouts will be ready to exploit upcoming matrix‑wide instructions without a complete rewrite.
Limitations & Future Work
- The study focuses on the Wilson‑Dirac operator; extending the approach to other fermion discretizations (e.g., staggered or domain‑wall) may require additional kernel redesign.
- The SME implementation is still prototype‑level; a full‑scale integration and compiler support are needed to assess real‑world gains.
- Compiler auto‑vectorization remains inconsistent across toolchains; the authors note that hand‑tuned intrinsics could close the remaining performance gap but at the cost of portability.
- Scaling beyond a few dozen nodes was not explored in depth; communication overhead for multi‑rhs batches could become a bottleneck in exascale runs.
Bottom line: By rethinking data layout and embracing SIMD‑friendly designs, the authors show that even legacy scientific codes can achieve substantial, portable speedups on modern heterogeneous clusters—an insight that developers across many high‑performance domains can put to immediate use.
Authors
- Shiting Long
- Gustavo Ramirez-Hidalgo
- Stepan Nassyr
- Jose Jimenez-Merchan
- Andreas Frommer
- Dirk Pleiter
Paper Information
- arXiv ID: 2601.05816v1
- Categories: cs.DC, hep-lat
- Published: January 9, 2026
- PDF: Download PDF