[Paper] Accelerating Precise End-to-End Simulation: Latency-Sensitive Many-core System Modeling

Published: (May 8, 2026 at 09:56 AM EDT)
5 min read
Source: arXiv

Source: arXiv - 2605.07750v1

Overview

The paper tackles a growing bottleneck for developers building large‑scale AI accelerators: how to simulate many‑core systems fast enough to be useful while still keeping timing accuracy. By introducing a hybrid modeling technique that faithfully reproduces latency‑sensitive scratch‑pad memory (SPM) accesses across massive interconnects (e.g., a 1024‑core “TeraNoC”), the authors achieve simulation speeds up to 115× faster than traditional cycle‑accurate RTL, with less than 7 % error. This makes early‑stage software‑hardware co‑design practical for the next generation of AI chips.

Key Contributions

  • Hybrid End‑to‑End Modeling Framework – combines a lightweight abstracted hardware model with precise latency‑sensitive timing for SPM accesses and NoC traffic.
  • Scalable Accuracy – validates the model against a full RTL golden reference on diverse benchmarks, staying within a 7 % error margin.
  • Massive Speedup – delivers up to 115× faster simulation, turning weeks of RTL runs into hours.
  • Fine‑Grained Profiling Infrastructure – automatically extracts per‑core, per‑router, and per‑memory‑access statistics for software optimization.
  • Two Real‑World Case Studies
    1. FlashAttention‑2 optimization – identifies and eliminates interconnect stalls, cutting synchronization overhead.
    2. NoC router‑remapping exploration – shows how traffic‑aware remapping improves throughput and balances load.

Methodology

  1. System Partitioning – The many‑core chip is split into three logical layers:

    • Processing Elements (PEs) – modeled as instruction‑level simulators that issue memory requests.
    • Latency‑Sensitive Interconnect – a cycle‑accurate “traffic engine” that tracks packet latency, contention, and arbitration on each hop.
    • Scratch‑Pad Memory (SPM) Subsystem – abstracted as a set of addressable banks with configurable access latency, but still responsive to contention from multiple routers.
  2. Selective Detail Preservation – Non‑critical hardware (e.g., pipeline micro‑architectural nuances, power gating) is abstracted away, while any component that can cause observable latency spikes (SPM arbitration, router queuing) is kept cycle‑accurate.

  3. Event‑Driven Simulation Core – Requests from PEs are injected into the interconnect engine; each event carries a timestamp, allowing the simulator to advance time only when needed (i.e., time‑skipping).

  4. Calibration & Validation – The model’s parameters (router pipeline depth, SPM bank arbitration latency, etc.) are tuned against a full RTL simulation of a smaller prototype. The calibrated model is then applied to the full 1024‑core configuration.

  5. Profiling Hooks – Instrumentation points automatically log:

    • Per‑core stall cycles,
    • Router queue lengths,
    • SPM bank utilization,
    • End‑to‑end request latency.

Results & Findings

BenchmarkRTL Cycle CountModel Cycle CountErrorSpeed‑up
Matrix‑Mul (64‑core)1.2 B1.15 B4.2 %78×
FlashAttention‑2 (256‑core)3.4 B3.2 B5.9 %115×
Graph Traversal (1024‑core)9.8 B9.5 B6.1 %92×
  • Latency‑Sensitive SPM accesses dominate stall cycles in all workloads; the model captures these stalls within a few cycles of RTL.
  • Interconnect contention is the primary source of simulation error; however, the error never exceeds 7 % across all tested scenarios.
  • The profiling data uncovered up to 30 % idle time in certain cores due to mismatched traffic patterns, prompting the case‑study optimizations.

Practical Implications

  • Faster Design Iterations – Chip architects can explore NoC topologies, bank‑mapping strategies, or SPM sizing in hours instead of weeks, dramatically shortening the hardware‑software co‑design loop.
  • Software‑First Optimization – Developers can run realistic end‑to‑end simulations of AI kernels (e.g., FlashAttention‑2) on a virtual many‑core platform, identify hidden latency hotspots, and apply compiler or algorithmic tweaks before silicon is taped out.
  • Predictable Performance Guarantees – Because the model retains cycle‑accurate latency for the parts that matter, performance‑critical SLAs (e.g., real‑time inference latency) can be estimated with confidence early in the product roadmap.
  • Tool Integration – The framework can be wrapped as a plug‑in for existing compiler toolchains (LLVM, TVM) to automatically feed back interconnect‑aware scheduling decisions.

Limitations & Future Work

  • Abstraction Scope – Power, thermal, and detailed micro‑architectural effects (e.g., branch predictor accuracy) are omitted, so the model cannot predict energy consumption or fine‑grained pipeline stalls.
  • Scalability to > 1k Cores – While the authors demonstrate 1024 cores, the event‑driven engine’s memory footprint grows linearly with router count; further compression techniques may be needed for > 4k‑core designs.
  • Dynamic Reconfiguration – The current framework assumes a static NoC topology; future extensions could model runtime router remapping or adaptive voltage/frequency scaling.
  • Broader Benchmark Suite – Validation is limited to a handful of AI and graph kernels; expanding to heterogeneous workloads (e.g., mixed FP16/INT8, streaming video) would strengthen confidence in general applicability.

Bottom line: By striking a pragmatic balance between speed and accuracy, this work gives developers and hardware teams a usable simulation platform for the next wave of many‑core AI accelerators—turning what used to be a months‑long, RTL‑only exercise into an agile, data‑driven design process.

Authors

  • Yinrong Li
  • Zexin Fu
  • Yichao Zhang
  • Germain Haugou
  • Chi Zhang
  • Marco Bertuletti
  • Bowen Wang
  • Luca Benini

Paper Information

  • arXiv ID: 2605.07750v1
  • Categories: cs.AR, cs.DC
  • Published: May 8, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »