[Paper] Hierarchical Precision and Recursion for Accelerating Symmetric Linear Solves on MXUs

Published: 1 week ago (January 12, 2026 at 06:46 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.08082v1

Overview

This paper presents a portable mixed‑precision Cholesky solver that targets modern Matrix Processing Units (MXUs) such as NVIDIA Tensor Cores (H200) and AMD Matrix Cores (MI300X). By reorganizing the classic Cholesky algorithm into a hierarchical recursive structure, the authors can push the bulk of the work onto fast low‑precision GEMM units while keeping the numerically sensitive diagonal blocks in high precision, delivering dramatic speedups without sacrificing accuracy.

Key Contributions

Hierarchical recursion that decomposes Cholesky, TRSM, and SYRK into nested sub‑problems, exposing maximal parallelism for MXU‑accelerated GEMM.
Fine‑grained mixed‑precision data layout: large off‑diagonal tiles are computed in FP16, diagonal tiles stay in FP64 to preserve stability.
Julia‑based implementation that leverages multiple dispatch and dynamic type inference to write hardware‑agnostic code while still calling vendor‑optimized kernels under the hood.
Performance breakthroughs: up to 27× speedup for SYRK and 5× for TRSM versus full‑precision baselines; overall 5× speedup for Cholesky over cuSOLVER FP64.
Accuracy guarantees: the mixed‑precision solver attains ≈100× better accuracy than a pure FP16 approach while retaining ≈88 % of the peak FP16 throughput.
Cross‑vendor validation: comparable gains observed on both NVIDIA (H200) and AMD (MI300X) GPUs, demonstrating true portability.

Methodology

Recursive decomposition – The Cholesky factorization is expressed as a recursion that splits the matrix into quadrants. Each recursion step generates smaller TRSM and SYRK tasks, which themselves are recursively broken down until they fit the MXU tile size.
Precision tiering – A custom data structure tags each tile with its target precision. Off‑diagonal tiles (which are less sensitive to rounding) are cast to FP16 before being handed to the MXU GEMM engine. Diagonal tiles remain in FP64 and are processed with standard high‑precision kernels.
Julia front‑end – The algorithm is written in pure Julia, using multiple dispatch to automatically select the appropriate precision kernel at runtime. The code calls into cuBLAS / rocBLAS for the low‑level GEMM, TRSM, and SYRK operations, keeping the high‑level logic hardware‑agnostic.
Error control – After each recursion level, the algorithm performs a residual correction step (essentially a high‑precision refinement) to bound the error introduced by the FP16 tiles, ensuring the final solution meets FP64 accuracy requirements.

Results & Findings

Kernel	Baseline (FP64)	Mixed‑Precision (FP16‑FP64)	Speedup	Accuracy (relative to FP64)
SYRK	cuBLAS FP64	Recursive FP16‑FP64	27×	≈100× better than pure FP16
TRSM	cuBLAS FP64	Recursive FP16‑FP64	5×	Within FP64 tolerance
Cholesky (overall)	cuSOLVER FP64	Recursive mixed‑precision	5×	88 % of FP16 peak speed, FP64‑level accuracy

The same pattern holds on AMD MI300X: the mixed‑precision solver reaches similar speedups and error bounds, confirming that the recursion + precision‑tiering strategy is not tied to a single vendor’s MXU implementation.

Practical Implications

Accelerated scientific workloads – Applications that rely heavily on symmetric solves (e.g., finite‑element analysis, climate simulations, Gaussian process regression) can see order‑of‑magnitude reductions in wall‑clock time without rewriting their entire codebase.
Machine‑learning pipelines – Many ML algorithms (e.g., kernel methods, second‑order optimizers) need Cholesky factorizations; the mixed‑precision solver can make these steps competitive with pure‑FP32 training loops.
GPU‑centric libraries – The Julia implementation demonstrates a clean path for library authors to expose mixed‑precision linear algebra primitives that automatically dispatch to the best‑available MXU kernels.
Energy efficiency – Running more of the work in FP16 on MXUs reduces power draw per FLOP, which is valuable for large‑scale HPC clusters and edge AI devices with GPU accelerators.
Portability – Because the high‑level algorithm is hardware‑agnostic, developers can target future MXU architectures (e.g., upcoming NVIDIA Hopper or AMD CDNA successors) with minimal code changes.

Limitations & Future Work

Memory bandwidth pressure – The recursive tiling can increase data movement for very large matrices, potentially limiting scalability on memory‑bound systems.
Precision tuning overhead – Determining the optimal tile size and precision split currently requires empirical tuning; an adaptive runtime scheduler could automate this.
Sparse extensions – The current work focuses on dense symmetric matrices; extending the recursion to sparse Cholesky remains an open challenge.
Broader language support – While Julia offers a concise implementation, integrating the same approach into C++/Python ecosystems would broaden adoption.

Overall, the paper showcases a compelling recipe for squeezing every ounce of performance out of modern MXUs while keeping the numerical guarantees that engineers rely on.

Authors

Vicki Carrica
Rabab Alomairy
Evelyne Ringoot
Alan Edelman

Paper Information

arXiv ID: 2601.08082v1
Categories: cs.DC, cs.ET, cs.MS, cs.PF
Published: January 12, 2026
PDF: Download PDF

[Paper] Hierarchical Precision and Recursion for Accelerating Symmetric Linear Solves on MXUs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Space-Optimal, Computation-Optimal, Topology-Agnostic, Throughput-Scalable Causal Delivery through Hybrid Buffering

[Paper] Konflux: Optimized Function Fusion for Serverless Applications

[Paper] AFLL: Real-time Load Stabilization for MMO Game Servers Based on Circular Causality Learning

[Paper] Breaking the Storage-Bandwidth Tradeoff in Distributed Storage with Quantum Entanglement