[Paper] PackKV: Reducing KV Cache Memory Footprint through LLM-Aware Lossy Compression
Source: arXiv - 2512.24449v1
Overview
Large language models (LLMs) excel at generating long passages of text, but the key‑value (KV) cache they maintain during inference can quickly balloon to several gigabytes, choking GPU memory and limiting context length. The paper PackKV proposes a generic, LLM‑aware lossy‑compression framework that slashes the KV cache footprint while actually speeding up the underlying matrix‑vector operations.
Key Contributions
- LLM‑specific lossy compression for KV caches that exploits the statistical properties of transformer activations.
- Co‑designed compression/decompression kernels that integrate tightly with GPU matrix‑vector multiplication, eliminating extra memory traffic.
- Dynamic‑cache support: the scheme works as the KV cache grows token‑by‑token during generation.
- Empirical gains: up to ~150 %–180 % higher memory reduction than state‑of‑the‑art quantization, and ~75 %–172 % throughput improvements on A100 and RTX Pro 6000 GPUs.
- Open‑source implementation (GitHub) for easy adoption.
Methodology
- Data‑driven analysis – The authors first profile KV tensors (keys K and values V) across popular LLMs to identify redundancy patterns (e.g., low‑variance dimensions, correlated rows).
- Lossy compression design – Two complementary schemes are devised:
- Sparse quantization: aggressively quantize less‑important dimensions to fewer bits while preserving high‑variance components.
- Block‑wise low‑rank approximation: split the KV matrix into small blocks and approximate each with a low‑rank factorization, dramatically reducing storage.
- System integration – Custom CUDA kernels fuse the decompression step directly into the GEMV (matrix‑vector) compute, so the GPU never materializes the full uncompressed KV tensor. This “compute‑in‑place” approach sidesteps extra memory copies and bandwidth usage.
- Dynamic handling – As new tokens are generated, the framework incrementally compresses the newly appended KV entries without needing a full recompression pass.
Results & Findings
| Metric | Baseline (no compression) | State‑of‑the‑art quantization | PackKV |
|---|---|---|---|
| K‑cache memory reduction | 0 % | ~70 % | ~153 % (i.e., >2× reduction) |
| V‑cache memory reduction | 0 % | ~80 % | ~180 % |
| Throughput (K) | 1× (cuBLAS GEMV) | ~1.2× | 1.76× |
| Throughput (V) | 1× | ~1.3× | 2.72× |
| Accuracy drop | – | ≤ 0.5 % (typical) | ≤ 0.5 % (matched) |
Key takeaways: PackKV matches the tiny accuracy loss of existing quantization methods while delivering more than double the memory savings and substantial speedups because decompression is essentially free—its cost is absorbed by the GEMV kernel.
Practical Implications
- Longer context windows – Developers can push LLMs to handle thousands of tokens without hitting GPU memory limits, enabling richer document summarization, code generation, or chat histories.
- Higher batch throughput – With a smaller KV footprint, more concurrent requests fit on a single GPU, improving service latency and reducing hardware costs.
- Cost‑effective scaling – The memory bandwidth savings mean existing GPU clusters can serve larger workloads without upgrading to higher‑memory GPUs.
- Plug‑and‑play – PackKV works as a drop‑in replacement for KV cache handling in popular transformer libraries (e.g., Hugging Face Transformers), requiring minimal code changes.
- Edge‑AI possibilities – The reduced memory demand opens the door to running LLM inference on lower‑tier GPUs or even on‑device accelerators where memory is at a premium.
Limitations & Future Work
- Lossy nature – Although the accuracy impact is negligible for the evaluated benchmarks, safety‑critical or highly sensitive applications may still be wary of any degradation.
- Model‑specific tuning – Compression hyper‑parameters (e.g., block size, rank) were tuned per model; a fully auto‑tuned version would ease adoption across the rapidly expanding model zoo.
- Hardware diversity – Experiments focused on NVIDIA A100 and RTX Pro 6000; extending and benchmarking on AMD GPUs, TPUs, or upcoming inference‑focused ASICs remains open.
- Beyond KV – The authors suggest exploring similar compression for other intermediate activations (e.g., attention scores) to further shrink the inference memory budget.
PackKV demonstrates that smart, model‑aware compression can turn a memory bottleneck into a performance win, paving the way for more scalable, cost‑effective LLM deployments.
Authors
- Bo Jiang
- Taolue Yang
- Youyuan Liu
- Xubin He
- Sheng Di
- Sian Jin
Paper Information
- arXiv ID: 2512.24449v1
- Categories: cs.DC, cs.AI
- Published: December 30, 2025
- PDF: Download PDF