[Paper] DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference

Published: (April 29, 2026 at 07:44 AM EDT)
5 min read
Source: arXiv

Source: arXiv - 2604.26557v1

Overview

Edge devices are increasingly asked to run large‑language‑model (LLM) inference, but their limited RAM makes the KV‑cache—a per‑token memory structure essential for fast generation—a major bottleneck. DUAL‑BLADE introduces a smart dual‑path system that keeps KV tensors either in the OS page cache or directly on an NVMe SSD, switching on‑the‑fly according to how much memory is free. By bypassing the filesystem for the “direct” path and overlapping I/O with GPU DMA, the framework cuts latency and boosts SSD utilization, making edge‑LLM deployment far more practical.

Key Contributions

  • Dual‑path KV residency: Dynamically routes KV tensors to either a traditional page‑cache path or an NVMe‑direct path, eliminating the one‑size‑fits‑all limitation of existing offloaders.
  • Filesystem‑bypass design: Maps KV tensors to contiguous logical block addresses (LBA) and accesses them with O_DIRECT‑style I/O, removing kernel page‑cache thrashing and reducing software overhead.
  • Adaptive pipeline parallelism: Overlaps storage reads/writes with GPU DMA transfers, keeping the accelerator busy while the SSD works in the background.
  • Comprehensive evaluation: Demonstrates up to 33 % lower pre‑fill latency, 42 % lower decode latency, and 2.2× higher SSD utilization across a range of memory budgets (e.g., 4 GB‑16 GB).
  • Open‑source prototype: Provides a reference implementation that can be integrated with popular inference runtimes (e.g., vLLM, TensorRT‑LLM).

Methodology

  1. Runtime memory monitor – A lightweight daemon tracks free GPU and host memory. When the KV cache for a new token would exceed a configurable threshold, the system decides to offload that tensor.
  2. Path selection logic
    • Page‑cache path: If enough RAM is available, KV tensors are stored as regular memory‑mapped files, benefiting from OS caching and low‑latency access.
    • NVMe‑direct path: When memory is scarce, the tensor is written to a pre‑allocated contiguous region on the SSD using pwritev2 with O_DIRECT. The region’s LBA is recorded in a small in‑memory index.
  3. Direct‑access engine – A custom I/O scheduler issues asynchronous reads/writes directly to the NVMe driver, avoiding the VFS layer. The engine batches requests to match the SSD’s optimal I/O size (typically 128 KB‑1 MB).
  4. GPU‑DMA overlap – While the GPU is decoding the current token, the engine pre‑fetches the next KV block from the SSD into a pinned host buffer, then streams it to the GPU via DMA. This pipelining hides most of the storage latency.
  5. Evaluation setup – Experiments were run on an NVIDIA H100 GPU paired with a 2 TB NVMe PCIe 4.0 SSD, using popular LLMs (Llama‑2‑7B, Falcon‑40B) and varying host memory caps (4 GB, 8 GB, 12 GB, 16 GB). Baselines included the standard file‑based offload (vLLM) and a pure‑in‑memory cache.

Results & Findings

MetricIn‑memory (baseline)File‑based offloadDUAL‑BLADE
Prefill latency (7B, 8 GB)120 ms158 ms80 ms (‑33 %)
Decode latency per token (7B, 8 GB)18 ms31 ms17 ms (‑42 %)
SSD utilization (IOPS)N/A1.1 k2.4 k (×2.2)
CPU overhead (system calls / s)0.83.51.2

What the numbers mean

  • Latency cuts stem from eliminating page‑cache thrashing; the direct path provides predictable, low‑overhead reads.
  • Higher SSD utilization shows the pipeline parallelism keeps the storage device busy rather than idle while the GPU waits.
  • Reduced CPU overhead indicates fewer context switches and system‑call traffic compared with the file‑based approach.

Practical Implications

  • Edge AI deployments (e.g., autonomous drones, AR glasses, on‑premise assistants) can now run 7‑40 B parameter LLMs on devices with as little as 8 GB of host RAM without sacrificing responsiveness.
  • Cost‑effective scaling – Operators can provision cheaper servers with smaller memory footprints and rely on commodity NVMe SSDs for KV storage, lowering total cost of ownership.
  • Framework integration – The dual‑path logic can be wrapped as a plug‑in for existing inference servers (vLLM, DeepSpeed‑Inference). Developers get a drop‑in performance boost without rewriting model code.
  • Predictable SLAs – By decoupling KV residency from the OS page cache, latency becomes far less jittery, which is crucial for real‑time conversational agents.
  • Future hardware synergy – As NVMe‑direct APIs (e.g., SPDK, liburing) mature, DUAL‑BLADE’s design can exploit even lower latency paths, further narrowing the gap between memory and storage.

Limitations & Future Work

  • SSD wear – Frequent KV writes could accelerate flash wear; the current prototype does not implement wear‑leveling or write‑amplification mitigation.
  • Hardware specificity – Performance gains were measured on a high‑end PCIe 4.0 SSD; lower‑end or SATA drives may not see the same throughput improvements.
  • Scalability to multi‑GPU – The current design assumes a single GPU; extending the residency manager to coordinate KV placement across several GPUs and shared storage is an open challenge.
  • Security considerations – Direct‑access paths bypass filesystem permissions; integrating encryption or secure enclaves would be needed for privacy‑sensitive workloads.

The authors plan to explore adaptive wear‑aware policies, broader hardware support (including NVMe‑oF), and tighter integration with emerging inference runtimes.

Authors

  • Bodon Jeong
  • Hongsu Byun
  • Youngjae Kim
  • Weikuan Yu
  • Kyungkeun Lee
  • Jihoon Yang
  • Sungyong Park

Paper Information

  • arXiv ID: 2604.26557v1
  • Categories: cs.DC, cs.AI, cs.PF
  • Published: April 29, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »