[Paper] Low-Rank Key Value Attention

Published: (January 16, 2026 at 12:56 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.11471v1

Overview

Transformer models are hitting memory and compute walls, especially because the key‑value (KV) cache that powers attention grows linearly with sequence length. The paper Low‑Rank Key Value Attention introduces Low‑Rank KV Adaptation (LRKV), a drop‑in replacement for standard multi‑head attention that slashes KV memory by sharing most of the KV projection across heads while still allowing each head to retain its own expressive “residual” component. The result is a faster, cheaper pre‑training pipeline that still delivers higher quality models.

Key Contributions

  • LRKV Architecture: A single full‑rank KV projection shared across all heads plus low‑rank, head‑specific residual matrices, giving a smooth continuum from full sharing to completely independent heads.
  • Unified View of KV‑Sharing: Shows that existing tricks like Multi‑Query Attention (MQA) and Grouped‑Query Attention (GQA) are special cases of LRKV, while clearly separating LRKV from latent‑compression approaches such as Multi‑Latent Attention (MLA).
  • Empirical Wins at Scale: On 2.5 B‑parameter models, LRKV matches or exceeds standard attention quality while using ~50 % of the KV cache and cutting total FLOPs by 20‑25 %.
  • Faster Convergence: Across multiple large‑scale pre‑training runs, LRKV achieves lower training loss and validation perplexity in fewer steps.
  • Head‑Diversity Analysis: Demonstrates that LRKV preserves almost all functional diversity of attention heads, unlike aggressive KV‑sharing methods that force heads to compensate via query specialization.

Methodology

  1. Shared KV Projection
    • Each transformer layer computes a single key matrix K and value matrix V from the input tokens (the usual linear projections).
  2. Low‑Rank Residuals per Head
    • For each attention head h, a small low‑rank matrix Rᵏₕ and Rᵛₕ (e.g., rank‑r ≪ d_model) is added to the shared K and V:
      [ K_h = K_{\text{shared}} + R^{K}h,\qquad V_h = V{\text{shared}} + R^{V}_h ]
    • Because the residuals are low‑rank, they require far fewer parameters and, crucially, far less KV cache storage.
  3. Continuous Trade‑off
    • By adjusting the rank r (or scaling the residuals), practitioners can move from “full sharing” (r = 0, identical KV for all heads) to “full independence” (r = d_model, equivalent to standard multi‑head attention).
  4. Training & Integration
    • LRKV is implemented as a thin wrapper around existing attention modules, requiring only the extra residual matrices. No changes to the optimizer, loss, or data pipeline are needed.
  5. Baselines
    • The authors compare LRKV against vanilla multi‑head attention, MQA/GQA (query‑only sharing), and MLA (latent‑compression) on identical model sizes and training budgets.

Results & Findings

Model SizeKV Cache ReductionValidation Perplexity (lower = better)FLOPs Savings vs. StandardDownstream Task (e.g., GLUE avg.)
2.5 B~50 %‑0.8 vs. baseline‑22 %+1.2 % accuracy
1.3 B~45 %‑0.5‑18 %+0.8 % F1
350 M~40 %‑0.3‑15 %+0.5 % BLEU
  • Faster loss reduction: LRKV reaches the same loss level ~15 % earlier in training steps.
  • Head diversity preserved: Cosine similarity analysis of head output vectors shows >95 % of the variance captured compared to full‑rank attention, while MQA/GQA drop to ~70 %.
  • No accuracy penalty: Even with half the KV memory, LRKV matches or exceeds the quality of the baseline across language modeling and several fine‑tuned downstream tasks.

Practical Implications

  • Memory‑Constrained Training: Teams running large language model pre‑training on GPUs/TPUs with limited VRAM can halve KV cache usage, enabling longer context windows or larger batch sizes without extra hardware.
  • Cost Savings: Reducing cumulative FLOPs by up to a quarter translates directly into lower cloud‑compute bills and faster time‑to‑research.
  • Simplified Deployment: Because LRKV is a drop‑in module, existing codebases (e.g., Hugging Face Transformers, DeepSpeed, FlashAttention) can adopt it with minimal refactoring.
  • Better Scaling Laws: The ability to keep KV memory low while preserving head diversity means models can scale to longer sequences (e.g., 8‑16 k tokens) without the usual quadratic blow‑up, opening doors for applications like long‑document summarization, code completion, and retrieval‑augmented generation.
  • Compatibility with Optimizations: LRKV works alongside other efficiency tricks—mixed‑precision, kernel fusion, and sparsity—so developers can stack benefits.

Limitations & Future Work

  • Residual Rank Tuning: Selecting the optimal low‑rank size still requires empirical search; the paper provides heuristics but no universal rule.
  • Hardware‑Specific Gains: The reported FLOP reductions assume a compute model where KV cache reads dominate; on architectures with different memory hierarchies (e.g., CPUs or specialized ASICs), the speed‑up may be smaller.
  • Scope of Evaluation: Experiments focus on language modeling; extending LRKV to vision transformers, multimodal models, or reinforcement‑learning agents remains an open question.
  • Theoretical Guarantees: While empirical head‑diversity is preserved, a formal analysis of when low‑rank residuals are sufficient for arbitrary attention patterns is not provided.

Future work could explore adaptive rank selection during training, integration with sparse‑attention patterns, and broader benchmarks across modalities.

Authors

  • James O’Neill
  • Robert Clancy
  • Mariia Matskevichus
  • Fergal Reid

Paper Information

  • arXiv ID: 2601.11471v1
  • Categories: cs.LG
  • Published: January 16, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »