[Paper] InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
Source: arXiv - 2602.23200v1
Overview
Large language models (LLMs) keep a key‑value (KV) cache while generating text, and the cache size grows linearly with the generated sequence length. This quickly becomes the dominant memory consumer during inference, especially on GPUs where bandwidth is at a premium. InnerQ proposes a hardware‑aware, tuning‑free quantization method that compresses the KV cache, cuts memory traffic, and speeds up decoding—all without hurting the model’s answer quality.
Key Contributions
- Inner‑dimension group‑wise quantization – groups cache entries along the inner (hidden) dimension, aligning dequantization with the subsequent vector‑matrix multiply.
- Scale‑factor reuse across GPU compute units – reduces the number of memory reads needed for dequantization, yielding up to 22 % faster inference than prior KV‑cache quantizers.
- Hybrid quantization per group – automatically picks symmetric or asymmetric quantization based on local statistics, preserving numerical fidelity under aggressive compression.
- High‑precision windows – keeps the most recent tokens and “attention‑sink” tokens in higher precision to prevent outlier leakage.
- One‑time per‑channel key normalization – computed during the pre‑fill phase and folded into the query, eliminating extra runtime overhead.
- Empirical validation on LLaMA models – demonstrates near‑identical few‑shot GSM8K scores compared to full‑precision caches and outperforms existing KV‑cache quantization baselines.
Methodology
1. Cache Layout & Grouping
- The KV cache consists of two matrices: keys (K) and values (V).
- Instead of grouping rows (outer dimension) as earlier works did, InnerQ groups columns (the hidden‑size dimension). Each group contains a small block of contiguous hidden units (e.g., 64‑dim).
2. Quantization Scheme
- For each group, compute basic statistics (min, max, mean, variance).
- Choose symmetric quantization when the distribution is centered around zero; otherwise pick asymmetric quantization to capture skew.
- Encode the group with 4‑bit integers (the paper also explores 8‑bit) plus a shared scale factor for the whole group.
3. Dequantization Aligned with Attention
- During the attention step, the query vector multiplies the transposed key matrix. Because the grouping matches the inner dimension, the dequantization can be fused with the GEMV (vector‑matrix multiply) kernel.
- The shared scale factor is loaded once per compute unit and reused for all elements in the group, slashing memory bandwidth.
4. Precision Windows & Normalization
- The most recent N tokens (e.g., last 32) and the “sink” tokens (those that receive a lot of attention) stay in higher precision (FP16) to avoid error accumulation.
- A per‑channel (per‑hidden‑unit) scaling of the key matrix is computed once during the initial prompt (prefill) and baked into the query vector, so the runtime does not need extra normalization passes.
5. Implementation
- Integrated into a custom CUDA kernel that performs group‑wise dequantization + GEMV in a single pass.
- No additional hyper‑parameter tuning is required; the algorithm decides quantization mode automatically.
Results & Findings
| Model (LLaMA) | KV‑Cache Size Reduction | Decoding Latency vs. FP16 | Accuracy (GSM8K few‑shot) |
|---|---|---|---|
| 7B | ~4× (4‑bit) | ‑22 % vs. prior KV‑quantizer, ‑88 % vs. pure FP16 GEMV | ≈ 99 % of full‑precision score |
| 13B | ~4× (4‑bit) | Same trend, up to 22 % speedup | No statistically significant drop |
| 30B | ~4× (4‑bit) | Consistent latency gains | Slight (<0.2 %) degradation, still better than competing quantizers |
- Memory traffic dropped dramatically because each group shares a single scale factor, cutting the number of 32‑bit reads per token.
- Hybrid quantization prevented catastrophic outliers that would otherwise explode the attention scores.
- The high‑precision windows contributed the most to preserving accuracy, especially for long prompts (>1 k tokens).
Practical Implications
- Deployments on commodity GPUs (e.g., RTX 3090, A100) can handle longer contexts without hitting VRAM limits, enabling richer conversational agents or document‑level summarization.
- Cost savings: Smaller KV caches mean fewer GPU instances are needed for the same throughput, directly reducing cloud‑compute bills.
- Framework integration: The approach is compatible with existing transformer libraries (e.g., Hugging Face Transformers, vLLM) because it only changes the cache storage format and the attention kernel. No model retraining or fine‑tuning is required.
- Edge‑AI scenarios: For on‑device inference where memory is scarce (e.g., Jetson, mobile GPUs), InnerQ’s 4‑bit cache can make LLM inference feasible where it previously wasn’t.
- Future hardware design: The inner‑dimension grouping aligns nicely with upcoming tensor‑core instructions that operate on small blocks, suggesting that hardware vendors could expose primitives that further accelerate this pattern.
Limitations & Future Work
- Fixed group size: The current implementation uses a static group size (e.g., 64). Adaptive grouping based on token‑specific statistics could yield even better compression.
- Precision trade‑off: While 4‑bit works well for the evaluated models, ultra‑large models (>70B) may need a hybrid of 4‑bit and 8‑bit groups to stay within acceptable accuracy margins.
- Hardware dependence: The biggest speedups are observed on NVIDIA GPUs with high memory bandwidth; performance on other accelerators (TPUs, AMD GPUs) remains to be quantified.
- Out‑of‑distribution prompts: The evaluation focuses on GSM8K and standard benchmarks; robustness to highly noisy or adversarial prompts is not fully explored.
- Future directions suggested by the authors include:
- Co‑designing the quantizer with upcoming GPU tensor‑core APIs.
- Extending the method to compress the value cache more aggressively.
- Integrating learned per‑group scaling factors to further reduce quantization error.
Authors
- Sayed Mohammadreza Tayaranian Hosseini
- Amir Ardakani
- Warren J. Gross
Paper Information
- arXiv ID: 2602.23200v1
- Categories: cs.LG, cs.CL
- Published: February 26, 2026
- PDF: Download PDF