[Paper] InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models

Published: 3 days ago (February 26, 2026 at 11:50 AM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.23200v1

Overview

Large language models (LLMs) keep a key‑value (KV) cache while generating text, and the cache size grows linearly with the generated sequence length. This quickly becomes the dominant memory consumer during inference, especially on GPUs where bandwidth is at a premium. InnerQ proposes a hardware‑aware, tuning‑free quantization method that compresses the KV cache, cuts memory traffic, and speeds up decoding—all without hurting the model’s answer quality.

Key Contributions

Inner‑dimension group‑wise quantization – groups cache entries along the inner (hidden) dimension, aligning dequantization with the subsequent vector‑matrix multiply.
Scale‑factor reuse across GPU compute units – reduces the number of memory reads needed for dequantization, yielding up to 22 % faster inference than prior KV‑cache quantizers.
Hybrid quantization per group – automatically picks symmetric or asymmetric quantization based on local statistics, preserving numerical fidelity under aggressive compression.
High‑precision windows – keeps the most recent tokens and “attention‑sink” tokens in higher precision to prevent outlier leakage.
One‑time per‑channel key normalization – computed during the pre‑fill phase and folded into the query, eliminating extra runtime overhead.
Empirical validation on LLaMA models – demonstrates near‑identical few‑shot GSM8K scores compared to full‑precision caches and outperforms existing KV‑cache quantization baselines.

Methodology

1. Cache Layout & Grouping

The KV cache consists of two matrices: keys (K) and values (V).
Instead of grouping rows (outer dimension) as earlier works did, InnerQ groups columns (the hidden‑size dimension). Each group contains a small block of contiguous hidden units (e.g., 64‑dim).

2. Quantization Scheme

For each group, compute basic statistics (min, max, mean, variance).
Choose symmetric quantization when the distribution is centered around zero; otherwise pick asymmetric quantization to capture skew.
Encode the group with 4‑bit integers (the paper also explores 8‑bit) plus a shared scale factor for the whole group.

3. Dequantization Aligned with Attention

During the attention step, the query vector multiplies the transposed key matrix. Because the grouping matches the inner dimension, the dequantization can be fused with the GEMV (vector‑matrix multiply) kernel.
The shared scale factor is loaded once per compute unit and reused for all elements in the group, slashing memory bandwidth.

4. Precision Windows & Normalization

The most recent N tokens (e.g., last 32) and the “sink” tokens (those that receive a lot of attention) stay in higher precision (FP16) to avoid error accumulation.
A per‑channel (per‑hidden‑unit) scaling of the key matrix is computed once during the initial prompt (prefill) and baked into the query vector, so the runtime does not need extra normalization passes.

5. Implementation

Integrated into a custom CUDA kernel that performs group‑wise dequantization + GEMV in a single pass.
No additional hyper‑parameter tuning is required; the algorithm decides quantization mode automatically.

Results & Findings

Model (LLaMA)	KV‑Cache Size Reduction	Decoding Latency vs. FP16	Accuracy (GSM8K few‑shot)
7B	~4× (4‑bit)	‑22 % vs. prior KV‑quantizer, ‑88 % vs. pure FP16 GEMV	≈ 99 % of full‑precision score
13B	~4× (4‑bit)	Same trend, up to 22 % speedup	No statistically significant drop
30B	~4× (4‑bit)	Consistent latency gains	Slight (<0.2 %) degradation, still better than competing quantizers

Memory traffic dropped dramatically because each group shares a single scale factor, cutting the number of 32‑bit reads per token.
Hybrid quantization prevented catastrophic outliers that would otherwise explode the attention scores.
The high‑precision windows contributed the most to preserving accuracy, especially for long prompts (>1 k tokens).

Practical Implications

Deployments on commodity GPUs (e.g., RTX 3090, A100) can handle longer contexts without hitting VRAM limits, enabling richer conversational agents or document‑level summarization.
Cost savings: Smaller KV caches mean fewer GPU instances are needed for the same throughput, directly reducing cloud‑compute bills.
Framework integration: The approach is compatible with existing transformer libraries (e.g., Hugging Face Transformers, vLLM) because it only changes the cache storage format and the attention kernel. No model retraining or fine‑tuning is required.
Edge‑AI scenarios: For on‑device inference where memory is scarce (e.g., Jetson, mobile GPUs), InnerQ’s 4‑bit cache can make LLM inference feasible where it previously wasn’t.
Future hardware design: The inner‑dimension grouping aligns nicely with upcoming tensor‑core instructions that operate on small blocks, suggesting that hardware vendors could expose primitives that further accelerate this pattern.

Limitations & Future Work

Fixed group size: The current implementation uses a static group size (e.g., 64). Adaptive grouping based on token‑specific statistics could yield even better compression.
Precision trade‑off: While 4‑bit works well for the evaluated models, ultra‑large models (>70B) may need a hybrid of 4‑bit and 8‑bit groups to stay within acceptable accuracy margins.
Hardware dependence: The biggest speedups are observed on NVIDIA GPUs with high memory bandwidth; performance on other accelerators (TPUs, AMD GPUs) remains to be quantified.
Out‑of‑distribution prompts: The evaluation focuses on GSM8K and standard benchmarks; robustness to highly noisy or adversarial prompts is not fully explored.
Future directions suggested by the authors include:
1. Co‑designing the quantizer with upcoming GPU tensor‑core APIs.
2. Extending the method to compress the value cache more aggressively.
3. Integrating learned per‑group scaling factors to further reduce quantization error.

Authors

Sayed Mohammadreza Tayaranian Hosseini
Amir Ardakani
Warren J. Gross

Paper Information

arXiv ID: 2602.23200v1
Categories: cs.LG, cs.CL
Published: February 26, 2026
PDF: Download PDF