[Paper] PackKV: Reducing KV Cache Memory Footprint through LLM-Aware Lossy Compression

Published: 1 month ago (December 30, 2025 at 03:05 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.24449v1

Overview

Large language models (LLMs) excel at generating long passages of text, but the key‑value (KV) cache they maintain during inference can quickly balloon to several gigabytes, choking GPU memory and limiting context length. The paper PackKV proposes a generic, LLM‑aware lossy‑compression framework that slashes the KV cache footprint while actually speeding up the underlying matrix‑vector operations.

Key Contributions

LLM‑specific lossy compression for KV caches that exploits the statistical properties of transformer activations.
Co‑designed compression/decompression kernels that integrate tightly with GPU matrix‑vector multiplication, eliminating extra memory traffic.
Dynamic‑cache support: the scheme works as the KV cache grows token‑by‑token during generation.
Empirical gains: up to ~150 %–180 % higher memory reduction than state‑of‑the‑art quantization, and ~75 %–172 % throughput improvements on A100 and RTX Pro 6000 GPUs.
Open‑source implementation (GitHub) for easy adoption.

Methodology

Data‑driven analysis – The authors first profile KV tensors (keys K and values V) across popular LLMs to identify redundancy patterns (e.g., low‑variance dimensions, correlated rows).
Lossy compression design – Two complementary schemes are devised:
- Sparse quantization: aggressively quantize less‑important dimensions to fewer bits while preserving high‑variance components.
- Block‑wise low‑rank approximation: split the KV matrix into small blocks and approximate each with a low‑rank factorization, dramatically reducing storage.
System integration – Custom CUDA kernels fuse the decompression step directly into the GEMV (matrix‑vector) compute, so the GPU never materializes the full uncompressed KV tensor. This “compute‑in‑place” approach sidesteps extra memory copies and bandwidth usage.
Dynamic handling – As new tokens are generated, the framework incrementally compresses the newly appended KV entries without needing a full recompression pass.

Results & Findings

Metric	Baseline (no compression)	State‑of‑the‑art quantization	PackKV
K‑cache memory reduction	0 %	~70 %	~153 % (i.e., >2× reduction)
V‑cache memory reduction	0 %	~80 %	~180 %
Throughput (K)	1× (cuBLAS GEMV)	~1.2×	1.76×
Throughput (V)	1×	~1.3×	2.72×
Accuracy drop	–	≤ 0.5 % (typical)	≤ 0.5 % (matched)

Key takeaways: PackKV matches the tiny accuracy loss of existing quantization methods while delivering more than double the memory savings and substantial speedups because decompression is essentially free—its cost is absorbed by the GEMV kernel.

Practical Implications

Longer context windows – Developers can push LLMs to handle thousands of tokens without hitting GPU memory limits, enabling richer document summarization, code generation, or chat histories.
Higher batch throughput – With a smaller KV footprint, more concurrent requests fit on a single GPU, improving service latency and reducing hardware costs.
Cost‑effective scaling – The memory bandwidth savings mean existing GPU clusters can serve larger workloads without upgrading to higher‑memory GPUs.
Plug‑and‑play – PackKV works as a drop‑in replacement for KV cache handling in popular transformer libraries (e.g., Hugging Face Transformers), requiring minimal code changes.
Edge‑AI possibilities – The reduced memory demand opens the door to running LLM inference on lower‑tier GPUs or even on‑device accelerators where memory is at a premium.

Limitations & Future Work

Lossy nature – Although the accuracy impact is negligible for the evaluated benchmarks, safety‑critical or highly sensitive applications may still be wary of any degradation.
Model‑specific tuning – Compression hyper‑parameters (e.g., block size, rank) were tuned per model; a fully auto‑tuned version would ease adoption across the rapidly expanding model zoo.
Hardware diversity – Experiments focused on NVIDIA A100 and RTX Pro 6000; extending and benchmarking on AMD GPUs, TPUs, or upcoming inference‑focused ASICs remains open.
Beyond KV – The authors suggest exploring similar compression for other intermediate activations (e.g., attention scores) to further shrink the inference memory budget.

PackKV demonstrates that smart, model‑aware compression can turn a memory bottleneck into a performance win, paving the way for more scalable, cost‑effective LLM deployments.

Authors

Bo Jiang
Taolue Yang
Youyuan Liu
Xubin He
Sheng Di
Sian Jin

Paper Information

arXiv ID: 2512.24449v1
Categories: cs.DC, cs.AI
Published: December 30, 2025
PDF: Download PDF

[Paper] PackKV: Reducing KV Cache Memory Footprint through LLM-Aware Lossy Compression

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Two Deep Learning Approaches for Automated Segmentation of Left Ventricle in Cine Cardiac MRI

[Paper] Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning

[Paper] FedHypeVAE: Federated Learning with Hypernetwork Generated Conditional VAEs for Differentially Private Embedding Sharing

[Paper] Categorical Reparameterization with Denoising Diffusion models