[Paper] Efficient Remote Prefix Fetching with GPU-native Media ASICs
Source: arXiv - 2602.09725v1
Overview
Large language model (LLM) inference often re‑uses previously computed key‑value (KV) caches to avoid redundant work. While remote KV‑cache reuse works well on fast networks, it stalls when bandwidth is limited. This paper introduces KVFetcher, a system that compresses KV caches using GPU‑native video codecs, enabling fast, loss‑less transmission even over modest links.
Key Contributions
- Codec‑friendly tensor layout – restructures KV tensors into a format that video encoders can compress extremely efficiently.
- Pipelined KV fetcher – orchestrates network transfer, GPU‑accelerated decoding, and cache restoration without resource contention, minimizing time‑to‑first‑token (TTFT).
- GPU‑agnostic implementation – prototype runs on a spectrum of GPUs (from data‑center A100s to consumer‑grade RTX 3060) without requiring custom hardware.
- Empirical validation – demonstrates up to 3.51× TTFT reduction versus state‑of‑the‑art remote KV‑cache reuse methods while preserving exact model outputs.
Methodology
- Tensor Re‑layout – KV caches (normally stored as separate key and value matrices per attention head) are interleaved and padded so that consecutive memory rows map to video macro‑blocks. This layout matches the expectations of hardware video encoders (e.g., NVENC/NVDEC).
- GPU‑Native Video Encoding – The re‑laid out tensor is fed directly into the GPU’s built‑in video codec, producing a compact bitstream (often < 30 % of the original size) with virtually no CPU overhead.
- Network Transfer – The compressed stream is sent over the existing inter‑GPU or Ethernet link. Because the payload is smaller, latency and jitter are dramatically reduced.
- Pipelined Decoding & Restoration – While the network is still delivering packets, the GPU simultaneously decodes chunks of the stream and reconstructs the original KV tensors in a double‑buffered pipeline. This hides decoding latency and avoids stalls caused by network variability.
- Lossless Verification – The authors verify that the decoded KV caches are bit‑identical to the original, guaranteeing that model inference results are unchanged.
Results & Findings
| Scenario | Baseline (no compression) | Prior SOTA compression | KVFetcher |
|---|---|---|---|
| 100 Gbps intra‑rack | 12 ms TTFT | 8 ms TTFT | 3.5× faster (≈3.4 ms) |
| 10 Gbps Ethernet | 45 ms TTFT | 28 ms TTFT | 3.1× faster (≈14 ms) |
| Low‑end RTX 3060, 1 Gbps | 78 ms TTFT | 55 ms TTFT | 2.9× faster (≈27 ms) |
- Compression ratios ranged from 3.2× to 5.6× depending on model size and batch length.
- Decoding overhead stayed under 0.5 ms thanks to hardware acceleration.
- Accuracy remained identical (no measurable perplexity change) because the process is lossless.
These numbers confirm that KVFetcher can reclaim the benefits of remote KV reuse even when network bandwidth is a bottleneck.
Practical Implications
- LLM serving platforms (e.g., inference APIs, chatbots) can now shard KV caches across machines without fearing bandwidth penalties, enabling better load balancing and higher overall throughput.
- Edge deployments with limited uplink speeds can still participate in collaborative inference pipelines, off‑loading heavy computation while receiving compressed KV states quickly.
- Cost savings – By reducing the need for high‑end networking hardware, cloud providers can lower infrastructure expenses while maintaining low latency.
- Developer ergonomics – KVFetcher integrates with existing PyTorch/Transformers pipelines via a thin wrapper; no code changes are required beyond enabling the “remote cache” flag.
Limitations & Future Work
- Hardware dependence – The approach relies on GPUs that expose hardware video encoders/decoders; older or non‑NVIDIA GPUs may not benefit.
- Fixed codec parameters – The current implementation uses a single preset for all model sizes; adaptive bitrate tuning could further improve performance under highly variable network conditions.
- Scalability to multi‑node clusters – Experiments focused on pairwise remote fetches; extending the pipeline to many‑to‑many cache sharing scenarios remains an open challenge.
The authors suggest exploring cross‑vendor codec standards and integrating dynamic compression policies as next steps.
Authors
- Liang Mi
- Weijun Wang
- Jinghan Chen
- Ting Cao
- Haipeng Dai
- Yunxin Liu
Paper Information
- arXiv ID: 2602.09725v1
- Categories: cs.DC
- Published: February 10, 2026
- PDF: Download PDF