[Paper] Stencil Computations on Tenstorrent Wormhole
Source: arXiv - 2605.07599v1
Overview
The paper evaluates how well the Tenstorrent Wormhole, a RISC‑V‑based AI data‑flow accelerator, can run a classic HPC kernel: the 2‑D 5‑point stencil. By re‑thinking the stencil as either a series of element‑wise matrix operations (Axpy) or a matrix‑multiplication (MatMul), the authors compare performance, energy use, and bottlenecks against a conventional CPU implementation.
Key Contributions
- Two novel mapping strategies for stencil workloads on an AI accelerator:
- Axpy – decomposes the stencil into a pipeline of element‑wise sub‑matrix operations.
- MatMul – reformulates the stencil as a dense matrix‑multiplication to exploit the hardware’s GEMM engine.
- Comprehensive profiling that isolates compute time on the Wormhole from host‑side overhead (PCIe transfers, device init, preprocessing).
- Energy‑efficiency analysis showing that, for large problem sizes, the Axpy implementation consumes less power than the CPU baseline despite a slower wall‑clock time.
- Identification of architectural and software bottlenecks (memory bandwidth, PCIe latency, lack of native stencil primitives) and concrete recommendations for hardware‑software co‑design to make AI accelerators more HPC‑friendly.
Methodology
- Benchmark selection – The authors use the standard 2‑D 5‑point stencil (common in heat‑diffusion, CFD, and image‑processing codes).
- Kernel redesign –
- Axpy: The stencil update
out[i,j] = a*in[i,j] + b*in[i‑1,j] + c*in[i+1,j] + d*in[i,j‑1] + e*in[i,j+1]is broken into five separate element‑wise matrix adds/multiplies that map naturally onto the Wormhole’s vector ALUs. - MatMul: By padding and reshaping the input grid, the stencil is expressed as a sparse matrix‑vector product, which the accelerator can treat as a dense GEMM after appropriate tiling.
- Axpy: The stencil update
- Implementation stack – The kernels are written in Tenstorrent’s SDK (Python‑based API) and compiled to the on‑chip data‑flow ISA. Host code runs on an x86‑64 CPU and handles data movement over PCIe.
- Baseline – A highly optimized multi‑threaded CPU version (using OpenMP and cache‑blocking) serves as the reference.
- Metrics collected – End‑to‑end runtime, isolated accelerator compute time, PCIe transfer volume, power draw (via on‑board sensors), and energy‑to‑solution.
Results & Findings
| Metric | CPU Baseline | Axpy (Wormhole) | MatMul (Wormhole) |
|---|---|---|---|
| End‑to‑end runtime (large grid) | 1.0× (fastest) | ~3× slower | ~2.5× slower |
| Pure accelerator compute time | – | ≈1.1× CPU compute | ≈0.9× CPU compute |
| PCIe + init overhead | – | ~70 % of total time | ~60 % of total time |
| Energy per solve (large grid) | Higher | ~30 % lower | Slightly higher than CPU |
| Scaling with input size | Linear | Improves (energy) for > 10⁶ cells | Similar trend |
Takeaway: Once the data is resident on the accelerator, the Wormhole can compute the stencil as fast as a modern CPU. The dominant slowdown comes from moving data across PCIe and the one‑time device setup. The Axpy variant, while slower overall, wins on energy efficiency for big problem sizes.
Practical Implications
- Accelerator‑first HPC pipelines – For workloads where the same data is reused across many stencil passes (e.g., time‑stepping simulations), keeping data on‑chip could offset the PCIe penalty and make AI accelerators competitive.
- Energy‑constrained edge supercomputing – The lower Joule‑per‑solve of Axpy suggests AI chips could be attractive for remote or embedded HPC nodes where power budgets dominate.
- Software stack considerations – Developers need to factor in data‑movement costs; using unified memory or NVMe‑directed staging could shrink the host‑side gap.
- Algorithm redesign – Recasting traditional kernels into forms that match the accelerator’s strengths (e.g., GEMM‑friendly) can unlock hidden performance, as shown by the MatMul mapping.
- Tooling – The study highlights the need for profiling tools that separate host‑side and device‑side costs, which is crucial for realistic performance budgeting.
Limitations & Future Work
- PCIe bottleneck – The current Wormhole platform relies on a relatively slow PCIe Gen3 link; newer interconnects (CXL, PCIe Gen4/5) could dramatically improve end‑to‑end times.
- Memory hierarchy – On‑chip SRAM is limited; larger stencils overflow to off‑chip DRAM, incurring latency that the paper does not fully explore.
- Single‑precision focus – The experiments use FP32; mixed‑precision or integer stencil variants (common in some CFD codes) remain untested.
- Scalability – The study is limited to a single accelerator; multi‑node scaling, collective communication, and workload partitioning are open questions.
- Software maturity – The Tenstorrent SDK is still evolving; richer primitives (native stencil ops, better DMA scheduling) could reduce the need for manual kernel reformulation.
Future directions suggested by the authors include tighter CPU‑accelerator integration, hardware support for halo exchanges, and compiler extensions that automatically translate stencil DSLs into optimal data‑flow graphs for AI chips.
Authors
- Lorenzo Piarulli
- Daniele De Sensi
Paper Information
- arXiv ID: 2605.07599v1
- Categories: cs.DC, cs.ET
- Published: May 8, 2026
- PDF: Download PDF