[Paper] tritonBLAS: Triton-based Analytical Approach for GEMM Kernel Parameter Selection
Source: arXiv - 2512.04226v1
Overview
The paper introduces tritonBLAS, a deterministic analytical model that automatically selects high‑performance parameters for GEMM (general matrix‑multiply) kernels on GPUs. By leveraging architectural details such as cache hierarchy and data placement, tritonBLAS can generate near‑optimal kernels without the costly runtime autotuning that most libraries rely on.
Key Contributions
- Analytical performance model that maps GPU micro‑architecture (cache sizes, shared‑memory layout, etc.) to GEMM blocking parameters.
- Triton‑only implementation of a lightweight GEMM framework, eliminating the need for hand‑written CUDA kernels or external libraries.
- Zero‑runtime autotuning: the model predicts optimal configurations at compile‑time, achieving >95 % of the performance of state‑of‑the‑art autotuned solutions.
- Broad evaluation across a wide range of matrix shapes and modern GPUs (e.g., NVIDIA Ampere, Hopper), demonstrating consistent speed‑ups and low overhead.
- Open‑source potential: the approach can be integrated into existing Triton‑based projects or used as a drop‑in replacement for cuBLAS/rocBLAS in production pipelines.
Methodology
- Architectural profiling – The authors extract key hardware parameters (L1/L2 cache capacities, shared‑memory per SM, number of warps, etc.) from the target GPU.
- Analytical blocking model – They formulate a set of equations that relate these parameters to GEMM tiling choices (block‑size, thread‑block shape, register usage). The model captures three levels of blocking:
- Register tiling (inner micro‑kernel)
- Shared‑memory tiling (mid‑level)
- Cache‑level tiling (outer)
- Parameter selection algorithm – A lightweight search (often closed‑form) evaluates feasible tiling configurations and picks the one that maximizes estimated arithmetic intensity while respecting memory‑bandwidth and occupancy constraints.
- Kernel generation in Triton – The selected parameters are fed into a generic Triton GEMM template, which the Triton compiler specializes into a concrete GPU kernel.
- Validation – The generated kernels are benchmarked against autotuned libraries (e.g., cuBLAS, CUTLASS) on a suite of matrix dimensions ranging from small (e.g., 64×64) to large (e.g., 8192×8192).
Results & Findings
| GPU (arch) | Speedup vs. cuBLAS (avg.) | % of Autotuned Peak | Tuning Overhead |
|---|---|---|---|
| RTX 4090 (Ada) | +3 % | 96 % | 0 ms (model‑only) |
| A100 (Ampere) | +1 % | 95 % | 0 ms |
| H100 (Hopper) | +2 % | 97 % | 0 ms |
- Consistent performance across square, tall‑skinny, and short‑wide matrices.
- Zero autotuning time: the analytical model runs in a few microseconds, compared to minutes‑long search phases in traditional autotuners.
- Memory efficiency: the selected tilings respect L2 cache reuse, leading to lower DRAM traffic than naive kernels.
- Scalability: the approach scales to multi‑GPU setups because the model is purely local to each device; no cross‑device profiling is required.
Practical Implications
- Faster deployment cycles – Teams can ship new GEMM‑heavy workloads (e.g., transformer inference, scientific simulations) without waiting for autotuning runs on each target machine.
- Predictable performance – Since the model is deterministic, performance regressions are easier to trace and debug compared to stochastic autotuning results.
- Reduced cloud cost – Eliminating long autotuning phases translates directly into lower compute spend for on‑demand GPU instances.
- Ease of integration – TritonBLAS can be dropped into existing Triton codebases (e.g., custom kernels for diffusion models) with a single import, providing a high‑performance GEMM primitive out of the box.
- Portability – The analytical model adapts automatically to new GPU generations; only the hardware‑parameter extraction step needs updating, making it future‑proof for upcoming architectures.
Limitations & Future Work
- Model fidelity – While >95 % of autotuned performance is impressive, edge‑case matrix shapes (extremely unbalanced dimensions) still see a small gap.
- Non‑GEMM kernels – The current framework focuses on dense matrix multiplication; extending the analytical approach to convolutions or sparse kernels remains an open challenge.
- Dynamic workloads – For workloads that change matrix sizes at runtime, a lightweight re‑evaluation of the model is required; the authors plan to cache and reuse previous selections.
- Hardware diversity – The study concentrates on NVIDIA GPUs; applying the same methodology to AMD or Intel GPUs will need adaptation of the architectural model.
Overall, tritonBLAS demonstrates that a well‑crafted analytical model can replace expensive autotuning for GEMM, offering developers a fast, reliable, and portable way to achieve near‑optimal GPU performance.
Authors
- Ryan Swann
- Muhammad Osama
- Xiaohu Guo
- Bryant Nelson
- Lixun Zhang
- Alex Brown
- Yen Ong
- Ali Yazdani
- Sean Siddens
- Ganesh Dasika
- Alex Underwood
Paper Information
- arXiv ID: 2512.04226v1
- Categories: cs.DC
- Published: December 3, 2025
- PDF: Download PDF