[Paper] Revati: Transparent GPU-Free Time-Warp Emulation for LLM Serving

Published: (January 1, 2026 at 12:19 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.00397v1

Overview

Deploying large language models (LLMs) at scale is a costly and time‑consuming exercise because each serving configuration (batch size, tensor parallelism, request routing, etc.) must be benchmarked on real GPU hardware. Revati tackles this bottleneck by introducing a GPU‑free time‑warp emulator that runs the actual serving code (e.g., vLLM, SGLang) at simulation speed. By intercepting CUDA calls and “fast‑forwarding” virtual time instead of launching real kernels, Revati delivers accurate performance predictions while cutting evaluation time by an order of magnitude.

Key Contributions

  • Transparent GPU virtualization: Intercepts CUDA API calls and emulates device management without requiring any physical GPU.
  • Time‑warp kernel emulation: Predicts kernel execution time and advances virtual time instantly, preserving the original control flow of the serving framework.
  • Causality‑preserving coordination protocol: Synchronizes time jumps across distributed processes, ensuring correct ordering of events in multi‑node serving setups.
  • High fidelity: Achieves < 5 % prediction error across a variety of LLMs (e.g., LLaMA‑7B, 13B) and parallelism strategies.
  • Speedup of 5–17× over real‑GPU execution, dramatically reducing the cost of configuration search.

Methodology

  1. CUDA Interception Layer – Revati injects a thin wrapper around the CUDA runtime. Every call that would normally allocate memory, launch a kernel, or query device status is captured.
  2. Kernel Duration Modeling – For each distinct kernel (identified by its launch parameters), Revati maintains a lightweight statistical model (e.g., linear regression on input size) that predicts its runtime on the target GPU.
  3. Time‑warp Execution – Instead of dispatching the kernel to a GPU, Revati instantly increments a virtual clock by the predicted duration. The serving code sees the same API responses it would on real hardware, but the underlying computation is skipped.
  4. Distributed Coordination – In multi‑node serving, processes exchange time‑warp messages that announce upcoming jumps. A simple two‑phase commit ensures that all nodes agree on the new virtual time before proceeding, preventing causality violations.
  5. Validation Loop – The authors calibrated the kernel models using a small set of real‑GPU runs, then evaluated Revati on full serving stacks (vLLM, SGLang) across multiple models and parallelism configurations.

Results & Findings

ScenarioPrediction ErrorSpeedup vs. Real GPU
vLLM, LLaMA‑7B, 8‑way tensor parallelism3.8 %12×
SGLang, LLaMA‑13B, 4‑way pipeline parallelism4.5 %
Mixed batch sizes, varying request rates≤ 5 %5–17×
  • Accuracy: Across all tested setups, Revati’s latency and throughput estimates stayed within 5 % of the ground‑truth measurements.
  • Scalability: The coordination protocol added negligible overhead (< 1 % of total runtime) even when emulating 64 distributed workers.
  • Robustness: The emulator handled dynamic workload changes (e.g., sudden spikes in request arrival) without breaking causality.

Practical Implications

  • Rapid configuration search: Teams can explore hundreds of batch‑size / parallelism combos in minutes rather than hours, dramatically shortening the “performance tuning” cycle.
  • Cost reduction: Eliminating the need for large GPU clusters during the testing phase saves thousands of dollars per model iteration.
  • CI/CD integration: Revati can be plugged into continuous integration pipelines to automatically validate that a new serving code change does not degrade latency or throughput.
  • Hardware‑agnostic profiling: Since the emulator predicts runtime based on a model of the target GPU, developers can evaluate how a serving stack would behave on future hardware generations without waiting for physical access.
  • Educational tool: New engineers can experiment with low‑level serving internals (memory allocation, kernel launch patterns) without needing expensive GPUs.

Limitations & Future Work

  • Model‑driven kernel timing: Accuracy hinges on the quality of the kernel duration models; exotic kernels or new GPU architectures may require re‑training.
  • No memory bandwidth effects: Revati abstracts away actual data movement, so it cannot capture contention or out‑of‑memory scenarios that would arise on real hardware.
  • Limited to CUDA: The current prototype works only with NVIDIA’s CUDA stack; extending to AMD or Intel GPUs would need additional interception layers.
  • Future directions: The authors plan to incorporate memory‑traffic modeling, support for mixed‑precision kernels, and a plug‑in system for custom hardware simulators (e.g., TPUs).

Revati demonstrates that you don’t need a full‑blown GPU farm to get trustworthy performance numbers for LLM serving. By marrying transparent CUDA interception with a lightweight time‑warp engine, it opens the door to faster, cheaper, and more iterative deployment pipelines—something every AI‑focused development team can benefit from.

Authors

  • Amey Agrawal
  • Mayank Yadav
  • Sukrit Kumar
  • Anirudha Agrawal
  • Garv Ghai
  • Souradeep Bera
  • Elton Pinto
  • Sirish Gambhira
  • Mohammad Adain
  • Kasra Sohrab
  • Chus Antonanzas
  • Alexey Tumanov

Paper Information

  • arXiv ID: 2601.00397v1
  • Categories: cs.DC, cs.LG
  • Published: January 1, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »