[Paper] TeraPool: A Physical Design Aware, 1024 RISC-V Cores Shared-L1-Memory Scaled-up Cluster Design with High Bandwidth Main Memory Link

Published: (March 2, 2026 at 04:05 AM EST)
5 min read
Source: arXiv

Source: arXiv - 2603.01629v1

Overview

The paper introduces TeraPool, a novel chip architecture that packs 1024 RISC‑V cores around a shared multi‑megabyte L1 memory while keeping the interconnect physically realizable. By moving from many small clusters to a single, “scaled‑up” cluster, the design slashes data‑movement overhead and achieves near‑gigahertz operation with impressive energy efficiency—making it a strong candidate for next‑generation AI accelerators and high‑performance compute engines.

Key Contributions

  • Massive‑scale shared‑L1 cluster: 1024 floating‑point‑capable RISC‑V cores sharing a >4 k‑banked L1 memory, the largest such cluster reported to date.
  • Hierarchical, low‑latency interconnect: A physically implementable PE‑to‑L1 network that scales linearly (instead of quadratically) with core count, delivering 1–11 cycle access latencies.
  • Energy‑efficient memory access: 9–13.5 pJ per bank access, comparable to the energy of a single FP32 FMA operation.
  • Full‑bandwidth HBM2E link: Integrated high‑speed main‑memory interface that can stream data at the native bandwidth of HBM2E, eliminating the classic “global‑interconnect bottleneck.”
  • Silicon results: Fabricated in 12 nm FinFET, running at 910 MHz (0.80 V, 25 °C) and delivering up to 1.89 TFLOP/s peak and 200 GFLOP/s/W sustained performance on benchmark kernels.

Methodology

  1. Architecture Design – The authors start from the observation that splitting workloads across many small clusters forces frequent data shuffling. They therefore propose a single large cluster where all cores can directly address a common L1 memory.
  2. Physical‑aware Interconnect – To avoid the quadratic blow‑up of a full crossbar, they build a hierarchical network: cores are grouped into small sub‑clusters that connect to a set of memory banks via a multi‑stage router. This keeps wiring length and routing congestion low, which is critical for a 1024‑core die.
  3. Memory Banking – The L1 is divided into >4000 banks, each independently addressable. Banking spreads traffic, reduces contention, and lets the interconnect route requests in parallel.
  4. HBM2E Integration – A dedicated high‑bandwidth link (similar to a memory controller) sits at the edge of the cluster, feeding the shared L1 with data at HBM2E rates.
  5. Silicon Prototyping – The whole system is taped‑out in a 12 nm FinFET process. Post‑silicon measurements validate frequency, latency, power, and performance on a suite of compute kernels (matrix‑multiply, convolution, etc.).

Results & Findings

MetricAchieved
Core count1024 RISC‑V PEs
Clock frequency910 MHz (typical)
Peak FP32 performance1.89 TFLOP/s
Energy efficiency200 GFLOP/s/W (average IPC ≈ 0.8)
L1 access latency1–11 cycles (depending on frequency)
Memory‑bank access energy9–13.5 pJ (≈ 0.74–1.1 × FMA energy)
HBM2E bandwidth utilizationFull native bandwidth sustained

The results demonstrate that a shared‑L1 cluster can be scaled to a thousand cores without prohibitive area or power penalties, and that the hierarchical interconnect adds only a few cycles of latency while keeping per‑access energy on par with compute. Benchmark kernels achieve high IPC, confirming that the architecture can keep the cores fed with data.

Practical Implications

  • AI/ML accelerators – The combination of massive parallelism, high‑bandwidth memory, and low‑energy data movement makes TeraPool a strong template for inference engines that need to process large tensors with minimal latency.
  • Edge‑to‑cloud compute modules – Because the design runs at sub‑1 GHz frequencies with excellent energy efficiency, it can be integrated into power‑constrained platforms (e.g., autonomous drones, smart cameras) that still demand high FLOP counts.
  • RISC‑V ecosystem – By demonstrating a scalable, production‑grade RISC‑V cluster, the work lowers the barrier for other vendors to build custom accelerators on the open ISA, encouraging a richer software stack and tooling support.
  • System‑level design – The hierarchical interconnect approach can be reused in other many‑core chips (e.g., CPUs, DSPs) to mitigate routing congestion, enabling higher core counts without a full crossbar.
  • Memory‑centric computing – The tight coupling of a shared L1 with an HBM2E link showcases a memory‑centric paradigm where data stays close to compute, reducing the need for costly global networks.

Limitations & Future Work

  • Scalability beyond 1024 cores – While the hierarchical network mitigates quadratic growth, further scaling may still hit routing density limits; exploring 3‑D stacking or chiplet integration could be next steps.
  • Software ecosystem – Efficiently mapping workloads onto a shared‑L1, thousand‑core fabric requires sophisticated compilers and runtime systems; the paper notes the need for tooling that can handle data placement and synchronization at this scale.
  • Process dependence – Results are tied to a 12 nm FinFET node; porting the design to newer nodes (e.g., 5 nm) could improve density and power but may also introduce new timing challenges for the hierarchical interconnect.
  • General‑purpose workloads – The evaluation focuses on floating‑point kernels; assessing performance on mixed‑precision, integer, or control‑heavy workloads would broaden the applicability of the architecture.

TL;DR: TeraPool proves that a single, physically‑realizable cluster of 1024 RISC‑V cores sharing a massive, banked L1 memory can deliver near‑gigahertz speeds, teraflop‑scale compute, and industry‑leading energy efficiency. Its hierarchical interconnect and full‑bandwidth HBM2E link open a practical path for developers building next‑gen AI accelerators and many‑core systems on the open RISC‑V platform.

Authors

  • Yichao Zhang
  • Marco Bertuletti
  • Chi Zhang
  • Samuel Riedel
  • Diyou Shen
  • Bowen Wang
  • Alessandro Vanelli-Coralli
  • Luca Benini

Paper Information

  • arXiv ID: 2603.01629v1
  • Categories: cs.DC, cs.AR
  • Published: March 2, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »