[Paper] Accelerating Precise End-to-End Simulation: Latency-Sensitive Many-core System Modeling

Published: 3 days ago (May 8, 2026 at 09:56 AM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.07750v1

Overview

The paper tackles a growing bottleneck for developers building large‑scale AI accelerators: how to simulate many‑core systems fast enough to be useful while still keeping timing accuracy. By introducing a hybrid modeling technique that faithfully reproduces latency‑sensitive scratch‑pad memory (SPM) accesses across massive interconnects (e.g., a 1024‑core “TeraNoC”), the authors achieve simulation speeds up to 115× faster than traditional cycle‑accurate RTL, with less than 7 % error. This makes early‑stage software‑hardware co‑design practical for the next generation of AI chips.

Key Contributions

Hybrid End‑to‑End Modeling Framework – combines a lightweight abstracted hardware model with precise latency‑sensitive timing for SPM accesses and NoC traffic.
Scalable Accuracy – validates the model against a full RTL golden reference on diverse benchmarks, staying within a 7 % error margin.
Massive Speedup – delivers up to 115× faster simulation, turning weeks of RTL runs into hours.
Fine‑Grained Profiling Infrastructure – automatically extracts per‑core, per‑router, and per‑memory‑access statistics for software optimization.
Two Real‑World Case Studies
1. FlashAttention‑2 optimization – identifies and eliminates interconnect stalls, cutting synchronization overhead.
2. NoC router‑remapping exploration – shows how traffic‑aware remapping improves throughput and balances load.

Methodology

System Partitioning – The many‑core chip is split into three logical layers:
- Processing Elements (PEs) – modeled as instruction‑level simulators that issue memory requests.
- Latency‑Sensitive Interconnect – a cycle‑accurate “traffic engine” that tracks packet latency, contention, and arbitration on each hop.
- Scratch‑Pad Memory (SPM) Subsystem – abstracted as a set of addressable banks with configurable access latency, but still responsive to contention from multiple routers.
Selective Detail Preservation – Non‑critical hardware (e.g., pipeline micro‑architectural nuances, power gating) is abstracted away, while any component that can cause observable latency spikes (SPM arbitration, router queuing) is kept cycle‑accurate.
Event‑Driven Simulation Core – Requests from PEs are injected into the interconnect engine; each event carries a timestamp, allowing the simulator to advance time only when needed (i.e., time‑skipping).
Calibration & Validation – The model’s parameters (router pipeline depth, SPM bank arbitration latency, etc.) are tuned against a full RTL simulation of a smaller prototype. The calibrated model is then applied to the full 1024‑core configuration.
Profiling Hooks – Instrumentation points automatically log:
- Per‑core stall cycles,
- Router queue lengths,
- SPM bank utilization,
- End‑to‑end request latency.

Results & Findings

Benchmark	RTL Cycle Count	Model Cycle Count	Error	Speed‑up
Matrix‑Mul (64‑core)	1.2 B	1.15 B	4.2 %	78×
FlashAttention‑2 (256‑core)	3.4 B	3.2 B	5.9 %	115×
Graph Traversal (1024‑core)	9.8 B	9.5 B	6.1 %	92×

Latency‑Sensitive SPM accesses dominate stall cycles in all workloads; the model captures these stalls within a few cycles of RTL.
Interconnect contention is the primary source of simulation error; however, the error never exceeds 7 % across all tested scenarios.
The profiling data uncovered up to 30 % idle time in certain cores due to mismatched traffic patterns, prompting the case‑study optimizations.

Practical Implications

Faster Design Iterations – Chip architects can explore NoC topologies, bank‑mapping strategies, or SPM sizing in hours instead of weeks, dramatically shortening the hardware‑software co‑design loop.
Software‑First Optimization – Developers can run realistic end‑to‑end simulations of AI kernels (e.g., FlashAttention‑2) on a virtual many‑core platform, identify hidden latency hotspots, and apply compiler or algorithmic tweaks before silicon is taped out.
Predictable Performance Guarantees – Because the model retains cycle‑accurate latency for the parts that matter, performance‑critical SLAs (e.g., real‑time inference latency) can be estimated with confidence early in the product roadmap.
Tool Integration – The framework can be wrapped as a plug‑in for existing compiler toolchains (LLVM, TVM) to automatically feed back interconnect‑aware scheduling decisions.

Limitations & Future Work

Abstraction Scope – Power, thermal, and detailed micro‑architectural effects (e.g., branch predictor accuracy) are omitted, so the model cannot predict energy consumption or fine‑grained pipeline stalls.
Scalability to > 1k Cores – While the authors demonstrate 1024 cores, the event‑driven engine’s memory footprint grows linearly with router count; further compression techniques may be needed for > 4k‑core designs.
Dynamic Reconfiguration – The current framework assumes a static NoC topology; future extensions could model runtime router remapping or adaptive voltage/frequency scaling.
Broader Benchmark Suite – Validation is limited to a handful of AI and graph kernels; expanding to heterogeneous workloads (e.g., mixed FP16/INT8, streaming video) would strengthen confidence in general applicability.

Bottom line: By striking a pragmatic balance between speed and accuracy, this work gives developers and hardware teams a usable simulation platform for the next wave of many‑core AI accelerators—turning what used to be a months‑long, RTL‑only exercise into an agile, data‑driven design process.

Authors

Yinrong Li
Zexin Fu
Yichao Zhang
Germain Haugou
Chi Zhang
Marco Bertuletti
Bowen Wang
Luca Benini

Paper Information

arXiv ID: 2605.07750v1
Categories: cs.AR, cs.DC
Published: May 8, 2026
PDF: Download PDF

[Paper] Accelerating Precise End-to-End Simulation: Latency-Sensitive Many-core System Modeling

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Stencil Computations on Cerebras Wafer-Scale Engine

[Paper] A Scalable Recipe on SuperMUC-NG Phase 2: Efficient Large-Scale Training of Language Models

[Paper] Stencil Computations on Tenstorrent Wormhole

[Paper] HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware