[Paper] TeraPool: A Physical Design Aware, 1024 RISC-V Cores Shared-L1-Memory Scaled-up Cluster Design with High Bandwidth Main Memory Link

Published: 1 day ago (March 2, 2026 at 04:05 AM EST)

5 min read

Source: arXiv

Source: arXiv - 2603.01629v1

Overview

The paper introduces TeraPool, a novel chip architecture that packs 1024 RISC‑V cores around a shared multi‑megabyte L1 memory while keeping the interconnect physically realizable. By moving from many small clusters to a single, “scaled‑up” cluster, the design slashes data‑movement overhead and achieves near‑gigahertz operation with impressive energy efficiency—making it a strong candidate for next‑generation AI accelerators and high‑performance compute engines.

Key Contributions

Massive‑scale shared‑L1 cluster: 1024 floating‑point‑capable RISC‑V cores sharing a >4 k‑banked L1 memory, the largest such cluster reported to date.
Hierarchical, low‑latency interconnect: A physically implementable PE‑to‑L1 network that scales linearly (instead of quadratically) with core count, delivering 1–11 cycle access latencies.
Energy‑efficient memory access: 9–13.5 pJ per bank access, comparable to the energy of a single FP32 FMA operation.
Full‑bandwidth HBM2E link: Integrated high‑speed main‑memory interface that can stream data at the native bandwidth of HBM2E, eliminating the classic “global‑interconnect bottleneck.”
Silicon results: Fabricated in 12 nm FinFET, running at 910 MHz (0.80 V, 25 °C) and delivering up to 1.89 TFLOP/s peak and 200 GFLOP/s/W sustained performance on benchmark kernels.

Methodology

Architecture Design – The authors start from the observation that splitting workloads across many small clusters forces frequent data shuffling. They therefore propose a single large cluster where all cores can directly address a common L1 memory.
Physical‑aware Interconnect – To avoid the quadratic blow‑up of a full crossbar, they build a hierarchical network: cores are grouped into small sub‑clusters that connect to a set of memory banks via a multi‑stage router. This keeps wiring length and routing congestion low, which is critical for a 1024‑core die.
Memory Banking – The L1 is divided into >4000 banks, each independently addressable. Banking spreads traffic, reduces contention, and lets the interconnect route requests in parallel.
HBM2E Integration – A dedicated high‑bandwidth link (similar to a memory controller) sits at the edge of the cluster, feeding the shared L1 with data at HBM2E rates.
Silicon Prototyping – The whole system is taped‑out in a 12 nm FinFET process. Post‑silicon measurements validate frequency, latency, power, and performance on a suite of compute kernels (matrix‑multiply, convolution, etc.).

Results & Findings

Metric	Achieved
Core count	1024 RISC‑V PEs
Clock frequency	910 MHz (typical)
Peak FP32 performance	1.89 TFLOP/s
Energy efficiency	200 GFLOP/s/W (average IPC ≈ 0.8)
L1 access latency	1–11 cycles (depending on frequency)
Memory‑bank access energy	9–13.5 pJ (≈ 0.74–1.1 × FMA energy)
HBM2E bandwidth utilization	Full native bandwidth sustained

The results demonstrate that a shared‑L1 cluster can be scaled to a thousand cores without prohibitive area or power penalties, and that the hierarchical interconnect adds only a few cycles of latency while keeping per‑access energy on par with compute. Benchmark kernels achieve high IPC, confirming that the architecture can keep the cores fed with data.

Practical Implications

AI/ML accelerators – The combination of massive parallelism, high‑bandwidth memory, and low‑energy data movement makes TeraPool a strong template for inference engines that need to process large tensors with minimal latency.
Edge‑to‑cloud compute modules – Because the design runs at sub‑1 GHz frequencies with excellent energy efficiency, it can be integrated into power‑constrained platforms (e.g., autonomous drones, smart cameras) that still demand high FLOP counts.
RISC‑V ecosystem – By demonstrating a scalable, production‑grade RISC‑V cluster, the work lowers the barrier for other vendors to build custom accelerators on the open ISA, encouraging a richer software stack and tooling support.
System‑level design – The hierarchical interconnect approach can be reused in other many‑core chips (e.g., CPUs, DSPs) to mitigate routing congestion, enabling higher core counts without a full crossbar.
Memory‑centric computing – The tight coupling of a shared L1 with an HBM2E link showcases a memory‑centric paradigm where data stays close to compute, reducing the need for costly global networks.

Limitations & Future Work

Scalability beyond 1024 cores – While the hierarchical network mitigates quadratic growth, further scaling may still hit routing density limits; exploring 3‑D stacking or chiplet integration could be next steps.
Software ecosystem – Efficiently mapping workloads onto a shared‑L1, thousand‑core fabric requires sophisticated compilers and runtime systems; the paper notes the need for tooling that can handle data placement and synchronization at this scale.
Process dependence – Results are tied to a 12 nm FinFET node; porting the design to newer nodes (e.g., 5 nm) could improve density and power but may also introduce new timing challenges for the hierarchical interconnect.
General‑purpose workloads – The evaluation focuses on floating‑point kernels; assessing performance on mixed‑precision, integer, or control‑heavy workloads would broaden the applicability of the architecture.

TL;DR: TeraPool proves that a single, physically‑realizable cluster of 1024 RISC‑V cores sharing a massive, banked L1 memory can deliver near‑gigahertz speeds, teraflop‑scale compute, and industry‑leading energy efficiency. Its hierarchical interconnect and full‑bandwidth HBM2E link open a practical path for developers building next‑gen AI accelerators and many‑core systems on the open RISC‑V platform.

Authors

Yichao Zhang
Marco Bertuletti
Chi Zhang
Samuel Riedel
Diyou Shen
Bowen Wang
Alessandro Vanelli-Coralli
Luca Benini

Paper Information

arXiv ID: 2603.01629v1
Categories: cs.DC, cs.AR
Published: March 2, 2026
PDF: Download PDF

[Paper] TeraPool: A Physical Design Aware, 1024 RISC-V Cores Shared-L1-Memory Scaled-up Cluster Design with High Bandwidth Main Memory Link

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Trident: Adaptive Scheduling for Heterogeneous Multimodal Data Pipelines

[Paper] Subcubic Coin Tossing in Asynchrony without Setup

[Paper] Beyond Microservices: Testing Web-Scale RCA Methods on GPU-Driven LLM Workloads

[Paper] HeRo: Adaptive Orchestration of Agentic RAG on Heterogeneous Mobile SoC