[Paper] Stencil Computations on Cerebras Wafer-Scale Engine

Published: 3 days ago (May 8, 2026 at 12:19 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.07954v1

Overview

Stencil computations—core kernels in fluid dynamics, climate modeling, and many other scientific simulations—are notoriously memory‑bound on conventional HPC platforms such as GPUs. This paper explores an unconventional solution: running 2‑D stencil kernels on the Cerebras Wafer‑Scale Engine (WSE‑3), a massive AI‑focused processor with terabytes of on‑chip SRAM and a high‑bandwidth mesh network. The authors introduce CStencil, a framework that maps stencil workloads onto the WSE‑3, and demonstrate speedups of up to 342× compared with a GPU baseline that has been carefully retuned for the same precision.

Key Contributions

CStencil framework: a first‑of‑its‑kind library that implements 2‑D stencil kernels on the Cerebras WSE‑3, handling data layout, tiling, and the engine’s unique dataflow model.
Fair GPU baseline: adaptation of the state‑of‑the‑art ConvStencil GPU solver from double‑ to single‑precision, ensuring an apples‑to‑apples comparison on an NVIDIA A100.
Empirical performance evaluation: extensive benchmarks showing up to 342× speedup, with detailed roofline analysis confirming that CStencil fully utilizes both compute and on‑chip memory bandwidth.
Architectural insight: demonstration that the WSE‑3’s distributed SRAM and mesh interconnect can eliminate the off‑chip memory bottleneck that limits stencil performance on GPUs.
Open‑source artifacts: release of the CStencil code and the modified ConvStencil benchmark, enabling reproducibility and further exploration by the community.

Methodology

Problem selection: The authors focus on classic 2‑D stencil patterns (e.g., 5‑point Laplacian) that are representative of many scientific codes.
Porting to the WSE‑3: Using Cerebras’ SDK, they express the stencil as a dataflow graph where each compute tile reads from its local SRAM, performs the arithmetic, and writes results back, leveraging the mesh network for halo exchanges between neighboring tiles.
GPU baseline preparation: ConvStencil, originally a double‑precision GPU stencil solver, is re‑implemented in single‑precision to match the precision used on the WSE‑3, and all kernel launch parameters are tuned for the A100.
Performance modeling: A roofline model is built for both platforms, using measured peak FLOPs and memory bandwidth (on‑chip SRAM for WSE‑3, HBM2 for A100). This model helps explain where each system sits relative to its theoretical limits.
Benchmarking: A suite of problem sizes (from small tiles that fit in a single WSE‑3 core to large domains that span the entire wafer) is run, measuring execution time, throughput, and energy consumption.

Results & Findings

Speedup: CStencil outperforms the single‑precision ConvStencil on A100 by 2.8×–342×, with the largest gains observed on problem sizes that fully exploit the wafer‑scale on‑chip memory.
Roofline saturation: On the WSE‑3, the stencil kernels reach the compute‑bound region of the roofline, indicating that both compute units and SRAM bandwidth are fully utilized. The GPU baseline remains memory‑bound despite HBM2’s high bandwidth.
Memory traffic reduction: Because all data resides in on‑chip SRAM, halo exchanges are handled by the mesh network with negligible latency, eliminating the costly off‑chip DRAM accesses that dominate GPU runtimes.
Energy efficiency: Preliminary power measurements suggest that CStencil consumes ~30% less energy per stencil update than the GPU baseline, thanks to reduced data movement.
Scalability: Performance scales linearly with the number of active tiles up to the full wafer, confirming that the mesh interconnect does not become a bottleneck for the examined stencil patterns.

Practical Implications

HPC developers can consider wafer‑scale engines as viable accelerators for memory‑intensive kernels, not just AI workloads.
Legacy scientific codes that rely on stencil patterns could be refactored to use a dataflow model, gaining orders‑of‑magnitude speedups without changing the underlying algorithmic logic.
Cloud providers offering Cerebras as a service may attract a new class of scientific users seeking to overcome the “memory wall” that plagues traditional GPU clusters.
Compiler and runtime tooling can take inspiration from CStencil’s tiling and halo‑exchange strategies to automate similar transformations for other memory‑bound kernels (e.g., finite‑difference time‑domain, cellular automata).
Energy‑constrained environments (e.g., edge HPC or exascale data centers) could benefit from the lower data‑movement costs of on‑chip SRAM, reducing operational expenses.

Limitations & Future Work

Precision focus: The study targets single‑precision arithmetic; many scientific domains still require double‑precision or mixed‑precision schemes, which may expose different performance characteristics on the WSE‑3.
2‑D only: While 2‑D stencils are a useful proxy, extending the approach to 3‑D kernels (common in climate and CFD) could encounter new challenges in tile communication and memory footprint.
Software ecosystem: CStencil currently relies on hand‑crafted dataflow graphs; integrating with higher‑level DSLs (e.g., Halide, Kokkos) would lower the barrier for adoption.
Portability: The performance gains are tightly coupled to the WSE‑3’s architecture; exploring how the techniques translate to other wafer‑scale or large‑SRAM platforms remains an open question.
Comprehensive energy analysis: The paper provides preliminary power numbers; a full lifecycle energy assessment (including cooling and system overhead) would strengthen the case for real‑world deployment.

Authors

Elia Belli
Daniele De Sensi

Paper Information

arXiv ID: 2605.07954v1
Categories: cs.DC, cs.CE, cs.ET
Published: May 8, 2026
PDF: Download PDF

[Paper] Stencil Computations on Cerebras Wafer-Scale Engine

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Accelerating Precise End-to-End Simulation: Latency-Sensitive Many-core System Modeling

[Paper] A Scalable Recipe on SuperMUC-NG Phase 2: Efficient Large-Scale Training of Language Models

[Paper] Stencil Computations on Tenstorrent Wormhole

[Paper] HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware