[Paper] Design in Tiles: Automating GEMM Deployment on Tile-Based Many-PE Accelerators
Source: arXiv - 2512.13638v1
Overview
The paper introduces Design in Tiles (DiT), an automated framework that bridges the gap between high‑level GEMM (General Matrix Multiplication) code and the low‑level, tile‑based many‑PE (Processing Element) accelerators that power today’s AI chips. By coupling a deployment toolchain with a configurable execution model, DiT makes it possible to generate near‑optimal GEMM kernels without the painstaking manual tuning that current hardware‑specific libraries require.
Key Contributions
- End‑to‑end automation: A toolchain that takes a GEMM description and produces a fully‑tuned implementation for any tile‑based accelerator configuration.
- Configurable executable model: A parametric performance model that captures PE count, tile dimensions, memory hierarchy, and inter‑tile bandwidth, enabling rapid design‑space exploration.
- Scalable mapping strategy: A systematic method for partitioning matrices across 2‑D tile grids (e.g., 32 × 32 tiles) that maximizes PE utilization and hides communication latency.
- Performance parity and beyond: Demonstrated 1.2–2.0× speed‑ups over NVIDIA GH200’s expert‑tuned GEMM libraries on a simulated 1979 TFLOPS FP8 accelerator with 4 TB/s bandwidth.
- Open‑source prototype: The authors release the DiT framework and a set of benchmark scripts, encouraging community adoption and further research.
Methodology
- Hardware Abstraction: The authors model a tile‑based accelerator as a grid of identical PEs, each with local registers, a small scratchpad, and a high‑bandwidth interconnect. Parameters such as tile size, PE count, and memory bandwidth are exposed as knobs.
- Tile‑Level Scheduling: GEMM’s three nested loops (i, j, k) are re‑ordered and tiled to match the hardware grid. DiT automatically decides how many rows/columns of the output matrix each tile should compute and how the reduction dimension (k) is split across time steps.
- Data Movement Planning: Using the executable model, DiT predicts the cost of loading tiles from the shared memory hierarchy, schedules double‑buffering, and inserts prefetches to keep PEs busy.
- Code Generation: The high‑level schedule is lowered to C++/CUDA‑like kernels that target the specific PE ISA (e.g., RISC‑V‑based or custom SIMD). The generated code is compiled with the vendor’s backend, producing a binary ready for the target accelerator.
- Iterative Optimization: A lightweight autotuner explores a small set of tile‑size candidates, guided by the model’s performance estimates, to pick the configuration that yields the highest predicted PE utilization.
Results & Findings
| Configuration | Baseline (GH200) | DiT‑Generated GEMM | Speed‑up |
|---|---|---|---|
| Square matrices (8192 × 8192) | 1.0× (reference) | 1.8× | +80% |
| Tall‑skinny (16384 × 1024) | 1.0× | 1.3× | +30% |
| Wide‑short (1024 × 16384) | 1.0× | 2.0× | +100% |
- PE Utilization: DiT consistently kept > 95 % of PEs active, compared to ~70 % for the hand‑tuned GH200 kernels on irregular shapes.
- Memory Bandwidth: The model‑driven prefetch schedule achieved > 90 % of the theoretical 4 TB/s bandwidth, eliminating stalls caused by naive data movement.
- Compilation Time: End‑to‑end generation (including autotuning) completed in under 5 minutes on a standard workstation, far faster than the weeks‑long manual tuning cycles typical for custom ASIC libraries.
Practical Implications
- Accelerator Vendors: DiT offers a path to ship “plug‑and‑play” GEMM libraries with new AI chips, reducing the need for costly, expert‑only software teams.
- Framework Integrators: Deep‑learning stacks (e.g., PyTorch, TensorFlow) can call DiT as a backend for custom hardware, gaining performance gains without rewriting kernels.
- Edge & Cloud Deployments: The ability to automatically adapt GEMM to varying matrix shapes means better utilization for inference workloads that often involve non‑square tensors (e.g., transformer attention heads).
- Hardware‑Software Co‑Design: Designers can experiment with tile counts, interconnect bandwidth, and memory sizes in the model, instantly seeing the impact on GEMM performance—informing better silicon decisions early in the design cycle.
Limitations & Future Work
- Model Fidelity: While the executable model captures major latency sources, it abstracts away fine‑grained effects such as cache line conflicts and voltage‑frequency scaling, which can affect real silicon results.
- Scope Beyond GEMM: DiT currently focuses on dense matrix multiplication; extending the framework to convolutions, sparse kernels, or mixed‑precision workloads remains an open challenge.
- Hardware Diversity: The prototype targets a specific class of tile‑based many‑PE accelerators; adapting it to heterogeneous designs (e.g., combining tiles with vector units) will require additional modeling effort.
- Autotuning Overhead: Although lightweight, the autotuner may still miss globally optimal tile sizes for extremely large design spaces; integrating more sophisticated search algorithms is a planned direction.
Design in Tiles demonstrates that with the right abstraction and tooling, the traditionally arduous task of mapping GEMM onto cutting‑edge AI accelerators can be automated, delivering both performance gains and a faster time‑to‑market for next‑generation hardware.
Authors
- Aofeng Shen
- Chi Zhang
- Yakup Budanaz
- Alexandru Calotoiu
- Torsten Hoefler
- Luca Benini
Paper Information
- arXiv ID: 2512.13638v1
- Categories: cs.DC, cs.AR
- Published: December 15, 2025
- PDF: Download PDF