[Paper] Hierarchical Precision and Recursion for Accelerating Symmetric Linear Solves on MXUs

Published: (January 12, 2026 at 06:46 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.08082v1

Overview

This paper presents a portable mixed‑precision Cholesky solver that targets modern Matrix Processing Units (MXUs) such as NVIDIA Tensor Cores (H200) and AMD Matrix Cores (MI300X). By reorganizing the classic Cholesky algorithm into a hierarchical recursive structure, the authors can push the bulk of the work onto fast low‑precision GEMM units while keeping the numerically sensitive diagonal blocks in high precision, delivering dramatic speedups without sacrificing accuracy.

Key Contributions

  • Hierarchical recursion that decomposes Cholesky, TRSM, and SYRK into nested sub‑problems, exposing maximal parallelism for MXU‑accelerated GEMM.
  • Fine‑grained mixed‑precision data layout: large off‑diagonal tiles are computed in FP16, diagonal tiles stay in FP64 to preserve stability.
  • Julia‑based implementation that leverages multiple dispatch and dynamic type inference to write hardware‑agnostic code while still calling vendor‑optimized kernels under the hood.
  • Performance breakthroughs: up to 27× speedup for SYRK and 5× for TRSM versus full‑precision baselines; overall 5× speedup for Cholesky over cuSOLVER FP64.
  • Accuracy guarantees: the mixed‑precision solver attains ≈100× better accuracy than a pure FP16 approach while retaining ≈88 % of the peak FP16 throughput.
  • Cross‑vendor validation: comparable gains observed on both NVIDIA (H200) and AMD (MI300X) GPUs, demonstrating true portability.

Methodology

  1. Recursive decomposition – The Cholesky factorization is expressed as a recursion that splits the matrix into quadrants. Each recursion step generates smaller TRSM and SYRK tasks, which themselves are recursively broken down until they fit the MXU tile size.
  2. Precision tiering – A custom data structure tags each tile with its target precision. Off‑diagonal tiles (which are less sensitive to rounding) are cast to FP16 before being handed to the MXU GEMM engine. Diagonal tiles remain in FP64 and are processed with standard high‑precision kernels.
  3. Julia front‑end – The algorithm is written in pure Julia, using multiple dispatch to automatically select the appropriate precision kernel at runtime. The code calls into cuBLAS / rocBLAS for the low‑level GEMM, TRSM, and SYRK operations, keeping the high‑level logic hardware‑agnostic.
  4. Error control – After each recursion level, the algorithm performs a residual correction step (essentially a high‑precision refinement) to bound the error introduced by the FP16 tiles, ensuring the final solution meets FP64 accuracy requirements.

Results & Findings

KernelBaseline (FP64)Mixed‑Precision (FP16‑FP64)SpeedupAccuracy (relative to FP64)
SYRKcuBLAS FP64Recursive FP16‑FP6427×≈100× better than pure FP16
TRSMcuBLAS FP64Recursive FP16‑FP64Within FP64 tolerance
Cholesky (overall)cuSOLVER FP64Recursive mixed‑precision88 % of FP16 peak speed, FP64‑level accuracy

The same pattern holds on AMD MI300X: the mixed‑precision solver reaches similar speedups and error bounds, confirming that the recursion + precision‑tiering strategy is not tied to a single vendor’s MXU implementation.

Practical Implications

  • Accelerated scientific workloads – Applications that rely heavily on symmetric solves (e.g., finite‑element analysis, climate simulations, Gaussian process regression) can see order‑of‑magnitude reductions in wall‑clock time without rewriting their entire codebase.
  • Machine‑learning pipelines – Many ML algorithms (e.g., kernel methods, second‑order optimizers) need Cholesky factorizations; the mixed‑precision solver can make these steps competitive with pure‑FP32 training loops.
  • GPU‑centric libraries – The Julia implementation demonstrates a clean path for library authors to expose mixed‑precision linear algebra primitives that automatically dispatch to the best‑available MXU kernels.
  • Energy efficiency – Running more of the work in FP16 on MXUs reduces power draw per FLOP, which is valuable for large‑scale HPC clusters and edge AI devices with GPU accelerators.
  • Portability – Because the high‑level algorithm is hardware‑agnostic, developers can target future MXU architectures (e.g., upcoming NVIDIA Hopper or AMD CDNA successors) with minimal code changes.

Limitations & Future Work

  • Memory bandwidth pressure – The recursive tiling can increase data movement for very large matrices, potentially limiting scalability on memory‑bound systems.
  • Precision tuning overhead – Determining the optimal tile size and precision split currently requires empirical tuning; an adaptive runtime scheduler could automate this.
  • Sparse extensions – The current work focuses on dense symmetric matrices; extending the recursion to sparse Cholesky remains an open challenge.
  • Broader language support – While Julia offers a concise implementation, integrating the same approach into C++/Python ecosystems would broaden adoption.

Overall, the paper showcases a compelling recipe for squeezing every ounce of performance out of modern MXUs while keeping the numerical guarantees that engineers rely on.

Authors

  • Vicki Carrica
  • Rabab Alomairy
  • Evelyne Ringoot
  • Alan Edelman

Paper Information

  • arXiv ID: 2601.08082v1
  • Categories: cs.DC, cs.ET, cs.MS, cs.PF
  • Published: January 12, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »