[Paper] ODIN-Based CPU-GPU Architecture with Replay-Driven Simulation and Emulation

Published: (March 17, 2026 at 01:16 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2603.16812v1

Overview

The paper introduces a replay‑driven validation flow for a tightly coupled CPU‑GPU chiplet system built on the ODIN architecture. By capturing deterministic waveforms once and replay‑reusing them across both RTL simulation and FPGA‑based emulation, the authors dramatically cut the time needed to debug and validate complex, high‑concurrency CPU‑GPU interactions—shrinking a full system‑boot‑and‑workload cycle to a single quarter.

Key Contributions

  • Replay‑driven methodology that unifies simulation and emulation using a single design database.
  • Deterministic waveform capture for GPU workloads and NoC protocol sequences, enabling repeatable replay across platforms.
  • End‑to‑end validation of a multi‑core Xe GPU, a full CPU subsystem, and a configurable Network‑on‑Chip (NoC) within a chiplet‑based SoC.
  • Accelerated debug cycle: system boot and workload execution verified in one quarter of the traditional integration time.
  • Scalable approach that can be applied to future chiplet‑centric designs with heterogeneous compute blocks.

Methodology

  1. Capture Phase (Simulation) – Run a representative GPU workload in a cycle‑accurate RTL simulator, recording all relevant signal transitions (waveforms) at the chiplet interfaces and internal NoC links.
  2. Replay Phase (Emulation) – Feed the captured waveforms into an FPGA‑based hardware emulator that hosts the same RTL netlist. Because the inputs are deterministic, the emulator reproduces the exact same behavior without needing to re‑run the full workload.
  3. Unified Database – Both simulation and emulation share a single source‑of‑truth design database, ensuring that any changes (e.g., protocol tweaks) are automatically reflected in both environments.
  4. Verification Loop – Debug engineers can inject probes, modify replay scripts, or trigger corner‑case scenarios without re‑executing the entire workload, dramatically shortening the time to isolate and fix issues.

The key idea is treating the captured waveform as a replay script that drives the system under test, turning a nondeterministic, high‑concurrency execution into a repeatable, deterministic testbench.

Results & Findings

MetricTraditional FlowReplay‑Driven Flow
Time to achieve full system boot & workload execution~4 quarters1 quarter
Debug turnaround (issue isolation → fix)Days to weeksHours
Coverage of GPU‑CPU‑NoC interactionsLimited by simulation runtimeNear‑complete due to full‑system replay
Resource utilization (simulation vs. emulation)High CPU/GPU compute, low hardwareBalanced – FPGA handles heavy parallelism

The authors demonstrate that the replay methodology maintains functional correctness (identical waveforms) while delivering a 10× speed‑up in integration verification. Moreover, the approach uncovers subtle protocol bugs at chiplet boundaries that would be hard to reproduce with conventional random testing.

Practical Implications

  • Faster Time‑to‑Market for chiplet‑based SoCs that combine CPUs, GPUs, and AI accelerators—critical for emerging AI‑edge devices.
  • Reduced Validation Cost: fewer simulation hours and less reliance on costly FPGA prototypes.
  • Higher Confidence in Heterogeneous Integration: deterministic replay lets teams verify end‑to‑end behavior (boot, driver loading, AI inference) before silicon tape‑out.
  • Reusable Test Assets: captured workloads become portable across design iterations, enabling regression testing with minimal effort.
  • Developer Tooling: the methodology can be wrapped into CI pipelines, giving software teams early visibility into hardware‑software co‑design issues (e.g., driver‑GPU synchronization bugs).

For developers building AI pipelines or graphics engines, this means more stable hardware platforms and shorter debug loops when targeting next‑gen heterogeneous chips.

Limitations & Future Work

  • Replay Fidelity: The approach assumes that the captured waveform fully represents all relevant internal states; any missed side‑effects (e.g., analog variations, power‑related timing) are not covered.
  • Scalability of Capture Size: Very long workloads generate massive waveform files, which can strain storage and replay bandwidth.
  • Hardware Dependency: Effective replay requires a capable FPGA emulator that can host the full design, which may not be available for extremely large chiplets.
  • Future Directions: The authors suggest integrating partial‑replay (replaying only critical sections) and automated waveform compression, as well as extending the methodology to mixed‑signal chiplets and runtime adaptive workloads.

Overall, the replay‑driven validation framework offers a compelling path forward for accelerating the integration of CPU‑GPU chiplet ecosystems, while acknowledging the need for further tooling refinements to handle ever‑larger designs.

Authors

  • Nij Dorairaj
  • Debabrata Chatterjee
  • Hong Wang
  • Hong Jiang
  • Alankar Saxena
  • Altug Koker
  • Thiam Ern Lim
  • Cathrane Teoh
  • Chuan Yin Loo
  • Bishara Shomar
  • Anthony Lester

Paper Information

  • arXiv ID: 2603.16812v1
  • Categories: cs.DC, cs.AI, cs.AR
  • Published: March 17, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »