[Paper] TritonForge: Profiling-Guided Framework for Automated Triton Kernel Optimization

Published: 2 months ago (December 9, 2025 at 06:44 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.09196v1

Overview

TritonForge is a new framework that automatically tunes GPU kernels written in Triton, a popular DSL for high‑performance deep‑learning primitives. By coupling static kernel analysis with live profiling feedback, it iteratively rewrites code to eliminate bottlenecks—delivering up to 5× speed‑ups without requiring developers to be GPU architecture experts.

Key Contributions

Profiling‑guided optimization loop that ties runtime metrics directly to source‑level transformations.
Modular code‑generation pipeline that can plug in any reasoning engine (the prototype uses large language models, but the design is model‑agnostic).
Automated bottleneck detection for common Triton pitfalls such as sub‑optimal block sizes, memory layout mismatches, and insufficient instruction‑level parallelism.
Empirical validation across a suite of kernels (matrix multiplication, convolutions, attention) on multiple GPU generations, showing an average 1.76× performance gain and peaks of 5× over hand‑written baselines.
Open‑source prototype that demonstrates how developers can integrate TritonForge into existing CI pipelines for continuous performance regression testing.

Methodology

Static Analysis – TritonForge parses the Triton source to extract a high‑level IR (loop nests, memory accesses, thread/block configuration).
Initial Profiling – The kernel is compiled and executed on a target GPU while a lightweight profiler collects metrics (occupancy, memory bandwidth, stall reasons).
Bottleneck Classification – Using rule‑based heuristics (e.g., low occupancy → increase block size) and optional LLM‑driven reasoning, the system pinpoints which code patterns are hurting performance.
Transformation Generation – Candidate rewrites are produced: tiling adjustments, shared‑memory buffering, loop unrolling, or datatype changes.
Iterative Evaluation – Each transformed kernel is re‑compiled, re‑profiled, and compared against the best‑so‑far result. The loop stops when no further improvement is observed or a timeout is reached.

Because the loop is driven by actual runtime data, the optimizer can adapt to the quirks of different GPU microarchitectures (e.g., Ampere vs. Hopper) without hard‑coding architecture‑specific rules.

Results & Findings

Kernel Type	Baseline (Triton)	TritonForge Speed‑up	Success Rate
GEMM (FP16)	1.2 TFLOPs	3.8×	90 %
2‑D Convolution (int8)	0.9 TFLOPs	2.1×	85 %
Multi‑head Attention	0.6 TFLOPs	1.9×	78 %
Custom Reduce	0.4 TFLOPs	5.0× (outlier)	60 %

Average improvement: 1.76× across all tested kernels.
Optimization time: ~2–5 minutes per kernel on a single GPU, making it practical for CI pipelines.
Model‑agnosticity: Swapping the LLM for a simpler rule engine reduced the success rate by ~12 % but kept the pipeline functional, confirming that the core profiling‑feedback loop is the primary driver of gains.

Practical Implications

Developer productivity: Teams can write straightforward Triton kernels and let TritonForge handle low‑level tuning, freeing engineers to focus on algorithmic innovation.
Performance portability: The same Triton source can be automatically retuned for new GPU generations, reducing the maintenance burden when hardware upgrades occur.
CI/CD integration: Because the optimization loop is deterministic and relatively fast, it can be added as a step in continuous integration to catch performance regressions early.
Cost savings: Faster kernels translate directly into lower cloud GPU bills for large‑scale training or inference workloads.
Foundation for higher‑level tools: TritonForge’s modular design can be extended to other DSLs (e.g., CUDA‑Python, JAX XLA) or to incorporate domain‑specific cost models for energy or latency‑critical applications.

Limitations & Future Work

Reliance on profiling accuracy: On GPUs with limited profiling counters, the bottleneck classifier may misinterpret stalls, leading to sub‑optimal rewrites.
Search space explosion: The current heuristic search can miss globally optimal configurations for highly irregular kernels; integrating more sophisticated search algorithms (e.g., Bayesian optimization) is a planned upgrade.
LLM dependence: While the framework is model‑agnostic, the prototype’s best results still hinge on an LLM for code reasoning; future work will explore lightweight static analysis alternatives to reduce inference cost.
Broader benchmark coverage: The evaluation focused on a curated set of kernels; expanding to end‑to‑end models (e.g., full transformer training loops) will validate real‑world impact.

TritonForge demonstrates that data‑driven, automated tuning can bring expert‑level GPU performance within reach of everyday developers, opening the door to more scalable and maintainable high‑performance ML codebases.

Authors

Haonan Li
Keyu Man
Partha Kanuparthy
Hanning Chen
Wei Sun
Sreen Tallam
Chenguang Zhu
Kevin Zhu
Zhiyun Qian

Paper Information

arXiv ID: 2512.09196v1
Categories: cs.SE
Published: December 9, 2025
PDF: Download PDF

[Paper] TritonForge: Profiling-Guided Framework for Automated Triton Kernel Optimization

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] A Study of Library Usage in Agent-Authored Pull Requests

[Paper] Mini-SFC: A Comprehensive Simulation Framework for Orchestration and Management of Service Function Chains

[Paper] AutoFSM: A Multi-agent Framework for FSM Code Generation with IR and SystemC-Based Testing

[Paper] Visualisation for the CIS benchmark scanning results