[Paper] Accelerating Precise End-to-End Simulation: Latency-Sensitive Many-core System Modeling
Source: arXiv - 2605.07750v1
Overview
The paper tackles a growing bottleneck for developers building large‑scale AI accelerators: how to simulate many‑core systems fast enough to be useful while still keeping timing accuracy. By introducing a hybrid modeling technique that faithfully reproduces latency‑sensitive scratch‑pad memory (SPM) accesses across massive interconnects (e.g., a 1024‑core “TeraNoC”), the authors achieve simulation speeds up to 115× faster than traditional cycle‑accurate RTL, with less than 7 % error. This makes early‑stage software‑hardware co‑design practical for the next generation of AI chips.
Key Contributions
- Hybrid End‑to‑End Modeling Framework – combines a lightweight abstracted hardware model with precise latency‑sensitive timing for SPM accesses and NoC traffic.
- Scalable Accuracy – validates the model against a full RTL golden reference on diverse benchmarks, staying within a 7 % error margin.
- Massive Speedup – delivers up to 115× faster simulation, turning weeks of RTL runs into hours.
- Fine‑Grained Profiling Infrastructure – automatically extracts per‑core, per‑router, and per‑memory‑access statistics for software optimization.
- Two Real‑World Case Studies
- FlashAttention‑2 optimization – identifies and eliminates interconnect stalls, cutting synchronization overhead.
- NoC router‑remapping exploration – shows how traffic‑aware remapping improves throughput and balances load.
Methodology
-
System Partitioning – The many‑core chip is split into three logical layers:
- Processing Elements (PEs) – modeled as instruction‑level simulators that issue memory requests.
- Latency‑Sensitive Interconnect – a cycle‑accurate “traffic engine” that tracks packet latency, contention, and arbitration on each hop.
- Scratch‑Pad Memory (SPM) Subsystem – abstracted as a set of addressable banks with configurable access latency, but still responsive to contention from multiple routers.
-
Selective Detail Preservation – Non‑critical hardware (e.g., pipeline micro‑architectural nuances, power gating) is abstracted away, while any component that can cause observable latency spikes (SPM arbitration, router queuing) is kept cycle‑accurate.
-
Event‑Driven Simulation Core – Requests from PEs are injected into the interconnect engine; each event carries a timestamp, allowing the simulator to advance time only when needed (i.e., time‑skipping).
-
Calibration & Validation – The model’s parameters (router pipeline depth, SPM bank arbitration latency, etc.) are tuned against a full RTL simulation of a smaller prototype. The calibrated model is then applied to the full 1024‑core configuration.
-
Profiling Hooks – Instrumentation points automatically log:
- Per‑core stall cycles,
- Router queue lengths,
- SPM bank utilization,
- End‑to‑end request latency.
Results & Findings
| Benchmark | RTL Cycle Count | Model Cycle Count | Error | Speed‑up |
|---|---|---|---|---|
| Matrix‑Mul (64‑core) | 1.2 B | 1.15 B | 4.2 % | 78× |
| FlashAttention‑2 (256‑core) | 3.4 B | 3.2 B | 5.9 % | 115× |
| Graph Traversal (1024‑core) | 9.8 B | 9.5 B | 6.1 % | 92× |
- Latency‑Sensitive SPM accesses dominate stall cycles in all workloads; the model captures these stalls within a few cycles of RTL.
- Interconnect contention is the primary source of simulation error; however, the error never exceeds 7 % across all tested scenarios.
- The profiling data uncovered up to 30 % idle time in certain cores due to mismatched traffic patterns, prompting the case‑study optimizations.
Practical Implications
- Faster Design Iterations – Chip architects can explore NoC topologies, bank‑mapping strategies, or SPM sizing in hours instead of weeks, dramatically shortening the hardware‑software co‑design loop.
- Software‑First Optimization – Developers can run realistic end‑to‑end simulations of AI kernels (e.g., FlashAttention‑2) on a virtual many‑core platform, identify hidden latency hotspots, and apply compiler or algorithmic tweaks before silicon is taped out.
- Predictable Performance Guarantees – Because the model retains cycle‑accurate latency for the parts that matter, performance‑critical SLAs (e.g., real‑time inference latency) can be estimated with confidence early in the product roadmap.
- Tool Integration – The framework can be wrapped as a plug‑in for existing compiler toolchains (LLVM, TVM) to automatically feed back interconnect‑aware scheduling decisions.
Limitations & Future Work
- Abstraction Scope – Power, thermal, and detailed micro‑architectural effects (e.g., branch predictor accuracy) are omitted, so the model cannot predict energy consumption or fine‑grained pipeline stalls.
- Scalability to > 1k Cores – While the authors demonstrate 1024 cores, the event‑driven engine’s memory footprint grows linearly with router count; further compression techniques may be needed for > 4k‑core designs.
- Dynamic Reconfiguration – The current framework assumes a static NoC topology; future extensions could model runtime router remapping or adaptive voltage/frequency scaling.
- Broader Benchmark Suite – Validation is limited to a handful of AI and graph kernels; expanding to heterogeneous workloads (e.g., mixed FP16/INT8, streaming video) would strengthen confidence in general applicability.
Bottom line: By striking a pragmatic balance between speed and accuracy, this work gives developers and hardware teams a usable simulation platform for the next wave of many‑core AI accelerators—turning what used to be a months‑long, RTL‑only exercise into an agile, data‑driven design process.
Authors
- Yinrong Li
- Zexin Fu
- Yichao Zhang
- Germain Haugou
- Chi Zhang
- Marco Bertuletti
- Bowen Wang
- Luca Benini
Paper Information
- arXiv ID: 2605.07750v1
- Categories: cs.AR, cs.DC
- Published: May 8, 2026
- PDF: Download PDF