[Paper] Stencil Computations on Tenstorrent Wormhole

Published: 3 days ago (May 8, 2026 at 07:18 AM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.07599v1

Overview

The paper evaluates how well the Tenstorrent Wormhole, a RISC‑V‑based AI data‑flow accelerator, can run a classic HPC kernel: the 2‑D 5‑point stencil. By re‑thinking the stencil as either a series of element‑wise matrix operations (Axpy) or a matrix‑multiplication (MatMul), the authors compare performance, energy use, and bottlenecks against a conventional CPU implementation.

Key Contributions

Two novel mapping strategies for stencil workloads on an AI accelerator:
1. Axpy – decomposes the stencil into a pipeline of element‑wise sub‑matrix operations.
2. MatMul – reformulates the stencil as a dense matrix‑multiplication to exploit the hardware’s GEMM engine.
Comprehensive profiling that isolates compute time on the Wormhole from host‑side overhead (PCIe transfers, device init, preprocessing).
Energy‑efficiency analysis showing that, for large problem sizes, the Axpy implementation consumes less power than the CPU baseline despite a slower wall‑clock time.
Identification of architectural and software bottlenecks (memory bandwidth, PCIe latency, lack of native stencil primitives) and concrete recommendations for hardware‑software co‑design to make AI accelerators more HPC‑friendly.

Methodology

Benchmark selection – The authors use the standard 2‑D 5‑point stencil (common in heat‑diffusion, CFD, and image‑processing codes).
Kernel redesign –
- Axpy: The stencil update out[i,j] = a*in[i,j] + b*in[i‑1,j] + c*in[i+1,j] + d*in[i,j‑1] + e*in[i,j+1] is broken into five separate element‑wise matrix adds/multiplies that map naturally onto the Wormhole’s vector ALUs.
- MatMul: By padding and reshaping the input grid, the stencil is expressed as a sparse matrix‑vector product, which the accelerator can treat as a dense GEMM after appropriate tiling.
Implementation stack – The kernels are written in Tenstorrent’s SDK (Python‑based API) and compiled to the on‑chip data‑flow ISA. Host code runs on an x86‑64 CPU and handles data movement over PCIe.
Baseline – A highly optimized multi‑threaded CPU version (using OpenMP and cache‑blocking) serves as the reference.
Metrics collected – End‑to‑end runtime, isolated accelerator compute time, PCIe transfer volume, power draw (via on‑board sensors), and energy‑to‑solution.

Results & Findings

Metric	CPU Baseline	Axpy (Wormhole)	MatMul (Wormhole)
End‑to‑end runtime (large grid)	1.0× (fastest)	~3× slower	~2.5× slower
Pure accelerator compute time	–	≈1.1× CPU compute	≈0.9× CPU compute
PCIe + init overhead	–	~70 % of total time	~60 % of total time
Energy per solve (large grid)	Higher	~30 % lower	Slightly higher than CPU
Scaling with input size	Linear	Improves (energy) for > 10⁶ cells	Similar trend

Takeaway: Once the data is resident on the accelerator, the Wormhole can compute the stencil as fast as a modern CPU. The dominant slowdown comes from moving data across PCIe and the one‑time device setup. The Axpy variant, while slower overall, wins on energy efficiency for big problem sizes.

Practical Implications

Accelerator‑first HPC pipelines – For workloads where the same data is reused across many stencil passes (e.g., time‑stepping simulations), keeping data on‑chip could offset the PCIe penalty and make AI accelerators competitive.
Energy‑constrained edge supercomputing – The lower Joule‑per‑solve of Axpy suggests AI chips could be attractive for remote or embedded HPC nodes where power budgets dominate.
Software stack considerations – Developers need to factor in data‑movement costs; using unified memory or NVMe‑directed staging could shrink the host‑side gap.
Algorithm redesign – Recasting traditional kernels into forms that match the accelerator’s strengths (e.g., GEMM‑friendly) can unlock hidden performance, as shown by the MatMul mapping.
Tooling – The study highlights the need for profiling tools that separate host‑side and device‑side costs, which is crucial for realistic performance budgeting.

Limitations & Future Work

PCIe bottleneck – The current Wormhole platform relies on a relatively slow PCIe Gen3 link; newer interconnects (CXL, PCIe Gen4/5) could dramatically improve end‑to‑end times.
Memory hierarchy – On‑chip SRAM is limited; larger stencils overflow to off‑chip DRAM, incurring latency that the paper does not fully explore.
Single‑precision focus – The experiments use FP32; mixed‑precision or integer stencil variants (common in some CFD codes) remain untested.
Scalability – The study is limited to a single accelerator; multi‑node scaling, collective communication, and workload partitioning are open questions.
Software maturity – The Tenstorrent SDK is still evolving; richer primitives (native stencil ops, better DMA scheduling) could reduce the need for manual kernel reformulation.

Future directions suggested by the authors include tighter CPU‑accelerator integration, hardware support for halo exchanges, and compiler extensions that automatically translate stencil DSLs into optimal data‑flow graphs for AI chips.

Authors

Lorenzo Piarulli
Daniele De Sensi

Paper Information

arXiv ID: 2605.07599v1
Categories: cs.DC, cs.ET
Published: May 8, 2026
PDF: Download PDF

[Paper] Stencil Computations on Tenstorrent Wormhole

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Stencil Computations on Cerebras Wafer-Scale Engine

[Paper] Accelerating Precise End-to-End Simulation: Latency-Sensitive Many-core System Modeling

[Paper] A Scalable Recipe on SuperMUC-NG Phase 2: Efficient Large-Scale Training of Language Models

[Paper] HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware