[Paper] tritonBLAS: Triton-based Analytical Approach for GEMM Kernel Parameter Selection

Published: 2 months ago (December 3, 2025 at 02:46 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.04226v1

Overview

The paper introduces tritonBLAS, a deterministic analytical model that automatically selects high‑performance parameters for GEMM (general matrix‑multiply) kernels on GPUs. By leveraging architectural details such as cache hierarchy and data placement, tritonBLAS can generate near‑optimal kernels without the costly runtime autotuning that most libraries rely on.

Key Contributions

Analytical performance model that maps GPU micro‑architecture (cache sizes, shared‑memory layout, etc.) to GEMM blocking parameters.
Triton‑only implementation of a lightweight GEMM framework, eliminating the need for hand‑written CUDA kernels or external libraries.
Zero‑runtime autotuning: the model predicts optimal configurations at compile‑time, achieving >95 % of the performance of state‑of‑the‑art autotuned solutions.
Broad evaluation across a wide range of matrix shapes and modern GPUs (e.g., NVIDIA Ampere, Hopper), demonstrating consistent speed‑ups and low overhead.
Open‑source potential: the approach can be integrated into existing Triton‑based projects or used as a drop‑in replacement for cuBLAS/rocBLAS in production pipelines.

Methodology

Architectural profiling – The authors extract key hardware parameters (L1/L2 cache capacities, shared‑memory per SM, number of warps, etc.) from the target GPU.
Analytical blocking model – They formulate a set of equations that relate these parameters to GEMM tiling choices (block‑size, thread‑block shape, register usage). The model captures three levels of blocking:
- Register tiling (inner micro‑kernel)
- Shared‑memory tiling (mid‑level)
- Cache‑level tiling (outer)
Parameter selection algorithm – A lightweight search (often closed‑form) evaluates feasible tiling configurations and picks the one that maximizes estimated arithmetic intensity while respecting memory‑bandwidth and occupancy constraints.
Kernel generation in Triton – The selected parameters are fed into a generic Triton GEMM template, which the Triton compiler specializes into a concrete GPU kernel.
Validation – The generated kernels are benchmarked against autotuned libraries (e.g., cuBLAS, CUTLASS) on a suite of matrix dimensions ranging from small (e.g., 64×64) to large (e.g., 8192×8192).

Results & Findings

GPU (arch)	Speedup vs. cuBLAS (avg.)	% of Autotuned Peak	Tuning Overhead
RTX 4090 (Ada)	+3 %	96 %	0 ms (model‑only)
A100 (Ampere)	+1 %	95 %	0 ms
H100 (Hopper)	+2 %	97 %	0 ms

Consistent performance across square, tall‑skinny, and short‑wide matrices.
Zero autotuning time: the analytical model runs in a few microseconds, compared to minutes‑long search phases in traditional autotuners.
Memory efficiency: the selected tilings respect L2 cache reuse, leading to lower DRAM traffic than naive kernels.
Scalability: the approach scales to multi‑GPU setups because the model is purely local to each device; no cross‑device profiling is required.

Practical Implications

Faster deployment cycles – Teams can ship new GEMM‑heavy workloads (e.g., transformer inference, scientific simulations) without waiting for autotuning runs on each target machine.
Predictable performance – Since the model is deterministic, performance regressions are easier to trace and debug compared to stochastic autotuning results.
Reduced cloud cost – Eliminating long autotuning phases translates directly into lower compute spend for on‑demand GPU instances.
Ease of integration – TritonBLAS can be dropped into existing Triton codebases (e.g., custom kernels for diffusion models) with a single import, providing a high‑performance GEMM primitive out of the box.
Portability – The analytical model adapts automatically to new GPU generations; only the hardware‑parameter extraction step needs updating, making it future‑proof for upcoming architectures.

Limitations & Future Work

Model fidelity – While >95 % of autotuned performance is impressive, edge‑case matrix shapes (extremely unbalanced dimensions) still see a small gap.
Non‑GEMM kernels – The current framework focuses on dense matrix multiplication; extending the analytical approach to convolutions or sparse kernels remains an open challenge.
Dynamic workloads – For workloads that change matrix sizes at runtime, a lightweight re‑evaluation of the model is required; the authors plan to cache and reuse previous selections.
Hardware diversity – The study concentrates on NVIDIA GPUs; applying the same methodology to AMD or Intel GPUs will need adaptation of the architectural model.

Overall, tritonBLAS demonstrates that a well‑crafted analytical model can replace expensive autotuning for GEMM, offering developers a fast, reliable, and portable way to achieve near‑optimal GPU performance.

Authors

Ryan Swann
Muhammad Osama
Xiaohu Guo
Bryant Nelson
Lixun Zhang
Alex Brown
Yen Ong
Ali Yazdani
Sean Siddens
Ganesh Dasika
Alex Underwood

Paper Information

arXiv ID: 2512.04226v1
Categories: cs.DC
Published: December 3, 2025
PDF: Download PDF

[Paper] tritonBLAS: Triton-based Analytical Approach for GEMM Kernel Parameter Selection

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Metronome: Differentiated Delay Scheduling for Serverless Functions

[Paper] Are Bus-Mounted Edge Servers Feasible?

[Paper] Compiler-supported reduced precision and AoS-SoA transformations for heterogeneous hardware

[Paper] FedGMR: Federated Learning with Gradual Model Restoration under Asynchrony and Model Heterogeneity