[Paper] ODIN-Based CPU-GPU Architecture with Replay-Driven Simulation and Emulation

Published: 3 days ago (March 17, 2026 at 01:16 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2603.16812v1

Overview

The paper introduces a replay‑driven validation flow for a tightly coupled CPU‑GPU chiplet system built on the ODIN architecture. By capturing deterministic waveforms once and replay‑reusing them across both RTL simulation and FPGA‑based emulation, the authors dramatically cut the time needed to debug and validate complex, high‑concurrency CPU‑GPU interactions—shrinking a full system‑boot‑and‑workload cycle to a single quarter.

Key Contributions

Replay‑driven methodology that unifies simulation and emulation using a single design database.
Deterministic waveform capture for GPU workloads and NoC protocol sequences, enabling repeatable replay across platforms.
End‑to‑end validation of a multi‑core Xe GPU, a full CPU subsystem, and a configurable Network‑on‑Chip (NoC) within a chiplet‑based SoC.
Accelerated debug cycle: system boot and workload execution verified in one quarter of the traditional integration time.
Scalable approach that can be applied to future chiplet‑centric designs with heterogeneous compute blocks.

Methodology

Capture Phase (Simulation) – Run a representative GPU workload in a cycle‑accurate RTL simulator, recording all relevant signal transitions (waveforms) at the chiplet interfaces and internal NoC links.
Replay Phase (Emulation) – Feed the captured waveforms into an FPGA‑based hardware emulator that hosts the same RTL netlist. Because the inputs are deterministic, the emulator reproduces the exact same behavior without needing to re‑run the full workload.
Unified Database – Both simulation and emulation share a single source‑of‑truth design database, ensuring that any changes (e.g., protocol tweaks) are automatically reflected in both environments.
Verification Loop – Debug engineers can inject probes, modify replay scripts, or trigger corner‑case scenarios without re‑executing the entire workload, dramatically shortening the time to isolate and fix issues.

The key idea is treating the captured waveform as a replay script that drives the system under test, turning a nondeterministic, high‑concurrency execution into a repeatable, deterministic testbench.

Results & Findings

Metric	Traditional Flow	Replay‑Driven Flow
Time to achieve full system boot & workload execution	~4 quarters	1 quarter
Debug turnaround (issue isolation → fix)	Days to weeks	Hours
Coverage of GPU‑CPU‑NoC interactions	Limited by simulation runtime	Near‑complete due to full‑system replay
Resource utilization (simulation vs. emulation)	High CPU/GPU compute, low hardware	Balanced – FPGA handles heavy parallelism

The authors demonstrate that the replay methodology maintains functional correctness (identical waveforms) while delivering a 10× speed‑up in integration verification. Moreover, the approach uncovers subtle protocol bugs at chiplet boundaries that would be hard to reproduce with conventional random testing.

Practical Implications

Faster Time‑to‑Market for chiplet‑based SoCs that combine CPUs, GPUs, and AI accelerators—critical for emerging AI‑edge devices.
Reduced Validation Cost: fewer simulation hours and less reliance on costly FPGA prototypes.
Higher Confidence in Heterogeneous Integration: deterministic replay lets teams verify end‑to‑end behavior (boot, driver loading, AI inference) before silicon tape‑out.
Reusable Test Assets: captured workloads become portable across design iterations, enabling regression testing with minimal effort.
Developer Tooling: the methodology can be wrapped into CI pipelines, giving software teams early visibility into hardware‑software co‑design issues (e.g., driver‑GPU synchronization bugs).

For developers building AI pipelines or graphics engines, this means more stable hardware platforms and shorter debug loops when targeting next‑gen heterogeneous chips.

Limitations & Future Work

Replay Fidelity: The approach assumes that the captured waveform fully represents all relevant internal states; any missed side‑effects (e.g., analog variations, power‑related timing) are not covered.
Scalability of Capture Size: Very long workloads generate massive waveform files, which can strain storage and replay bandwidth.
Hardware Dependency: Effective replay requires a capable FPGA emulator that can host the full design, which may not be available for extremely large chiplets.
Future Directions: The authors suggest integrating partial‑replay (replaying only critical sections) and automated waveform compression, as well as extending the methodology to mixed‑signal chiplets and runtime adaptive workloads.

Overall, the replay‑driven validation framework offers a compelling path forward for accelerating the integration of CPU‑GPU chiplet ecosystems, while acknowledging the need for further tooling refinements to handle ever‑larger designs.

Authors

Nij Dorairaj
Debabrata Chatterjee
Hong Wang
Hong Jiang
Alankar Saxena
Altug Koker
Thiam Ern Lim
Cathrane Teoh
Chuan Yin Loo
Bishara Shomar
Anthony Lester

Paper Information

arXiv ID: 2603.16812v1
Categories: cs.DC, cs.AI, cs.AR
Published: March 17, 2026
PDF: Download PDF

[Paper] ODIN-Based CPU-GPU Architecture with Replay-Driven Simulation and Emulation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] NavTrust: Benchmarking Trustworthiness for Embodied Navigation

[Paper] FinTradeBench: A Financial Reasoning Benchmark for LLMs

[Paper] F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

[Paper] Spectrally-Guided Diffusion Noise Schedules