[Paper] TritonForge: Profiling-Guided Framework for Automated Triton Kernel Optimization

Published: (December 9, 2025 at 06:44 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.09196v1

Overview

TritonForge is a new framework that automatically tunes GPU kernels written in Triton, a popular DSL for high‑performance deep‑learning primitives. By coupling static kernel analysis with live profiling feedback, it iteratively rewrites code to eliminate bottlenecks—delivering up to 5× speed‑ups without requiring developers to be GPU architecture experts.

Key Contributions

  • Profiling‑guided optimization loop that ties runtime metrics directly to source‑level transformations.
  • Modular code‑generation pipeline that can plug in any reasoning engine (the prototype uses large language models, but the design is model‑agnostic).
  • Automated bottleneck detection for common Triton pitfalls such as sub‑optimal block sizes, memory layout mismatches, and insufficient instruction‑level parallelism.
  • Empirical validation across a suite of kernels (matrix multiplication, convolutions, attention) on multiple GPU generations, showing an average 1.76× performance gain and peaks of over hand‑written baselines.
  • Open‑source prototype that demonstrates how developers can integrate TritonForge into existing CI pipelines for continuous performance regression testing.

Methodology

  1. Static Analysis – TritonForge parses the Triton source to extract a high‑level IR (loop nests, memory accesses, thread/block configuration).
  2. Initial Profiling – The kernel is compiled and executed on a target GPU while a lightweight profiler collects metrics (occupancy, memory bandwidth, stall reasons).
  3. Bottleneck Classification – Using rule‑based heuristics (e.g., low occupancy → increase block size) and optional LLM‑driven reasoning, the system pinpoints which code patterns are hurting performance.
  4. Transformation Generation – Candidate rewrites are produced: tiling adjustments, shared‑memory buffering, loop unrolling, or datatype changes.
  5. Iterative Evaluation – Each transformed kernel is re‑compiled, re‑profiled, and compared against the best‑so‑far result. The loop stops when no further improvement is observed or a timeout is reached.

Because the loop is driven by actual runtime data, the optimizer can adapt to the quirks of different GPU microarchitectures (e.g., Ampere vs. Hopper) without hard‑coding architecture‑specific rules.

Results & Findings

Kernel TypeBaseline (Triton)TritonForge Speed‑upSuccess Rate
GEMM (FP16)1.2 TFLOPs3.8×90 %
2‑D Convolution (int8)0.9 TFLOPs2.1×85 %
Multi‑head Attention0.6 TFLOPs1.9×78 %
Custom Reduce0.4 TFLOPs5.0× (outlier)60 %
  • Average improvement: 1.76× across all tested kernels.
  • Optimization time: ~2–5 minutes per kernel on a single GPU, making it practical for CI pipelines.
  • Model‑agnosticity: Swapping the LLM for a simpler rule engine reduced the success rate by ~12 % but kept the pipeline functional, confirming that the core profiling‑feedback loop is the primary driver of gains.

Practical Implications

  • Developer productivity: Teams can write straightforward Triton kernels and let TritonForge handle low‑level tuning, freeing engineers to focus on algorithmic innovation.
  • Performance portability: The same Triton source can be automatically retuned for new GPU generations, reducing the maintenance burden when hardware upgrades occur.
  • CI/CD integration: Because the optimization loop is deterministic and relatively fast, it can be added as a step in continuous integration to catch performance regressions early.
  • Cost savings: Faster kernels translate directly into lower cloud GPU bills for large‑scale training or inference workloads.
  • Foundation for higher‑level tools: TritonForge’s modular design can be extended to other DSLs (e.g., CUDA‑Python, JAX XLA) or to incorporate domain‑specific cost models for energy or latency‑critical applications.

Limitations & Future Work

  • Reliance on profiling accuracy: On GPUs with limited profiling counters, the bottleneck classifier may misinterpret stalls, leading to sub‑optimal rewrites.
  • Search space explosion: The current heuristic search can miss globally optimal configurations for highly irregular kernels; integrating more sophisticated search algorithms (e.g., Bayesian optimization) is a planned upgrade.
  • LLM dependence: While the framework is model‑agnostic, the prototype’s best results still hinge on an LLM for code reasoning; future work will explore lightweight static analysis alternatives to reduce inference cost.
  • Broader benchmark coverage: The evaluation focused on a curated set of kernels; expanding to end‑to‑end models (e.g., full transformer training loops) will validate real‑world impact.

TritonForge demonstrates that data‑driven, automated tuning can bring expert‑level GPU performance within reach of everyday developers, opening the door to more scalable and maintainable high‑performance ML codebases.

Authors

  • Haonan Li
  • Keyu Man
  • Partha Kanuparthy
  • Hanning Chen
  • Wei Sun
  • Sreen Tallam
  • Chenguang Zhu
  • Kevin Zhu
  • Zhiyun Qian

Paper Information

  • arXiv ID: 2512.09196v1
  • Categories: cs.SE
  • Published: December 9, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »