[Paper] CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention

Published: (March 18, 2026 at 01:18 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2603.17946v1

Overview

The paper introduces CARE – a new pipeline for converting pretrained attention modules (e.g., grouped‑query attention) into multi‑head latent attention (MLA) without inflating the KV‑cache size. By taking the actual activation statistics into account and allocating rank budget more intelligently, CARE dramatically improves inference quality while keeping memory footprints unchanged, a win for anyone deploying large language models at scale.

Key Contributions

  • Activation‑preserving factorization – factorizes weight matrices using the covariance of real input activations rather than raw weight similarity, reducing “activation drift.”
  • Adjusted‑rank allocation – distributes a fixed KV‑budget across transformer layers based on each layer’s sensitivity, instead of using a uniform rank for all layers.
  • KV‑parity mapping – a re‑parameterization step that reshapes the converted K and V matrices to fit the MLA format while preserving the original KV‑cache size.
  • Empirical gains – on Qwen3‑4B/30B‑A3B‑Instruct and Llama‑3.1‑8B/70B‑Instruct models, CARE cuts one‑shot perplexity by up to 215× and lifts mean accuracy by up to 1.70× at identical KV budgets.
  • Fast post‑conversion fine‑tuning – a brief “healing” fine‑tune after SVD restores the model’s original accuracy, making the pipeline practical for production pipelines.

Methodology

  1. Collect activation statistics – run a small calibration dataset through the original model and record the covariance of the query, key, and value activations per layer.
  2. Covariance‑aware factorization – perform a low‑rank decomposition (similar to SVD) that minimizes reconstruction error in activation space (i.e., ‖A·W – A·Ŵ‖) rather than weight space alone. This aligns the approximation with what the model actually sees at inference time.
  3. Rank budgeting – compute a per‑layer “importance score” (e.g., based on singular values weighted by activation variance). Allocate more rank to layers with higher scores while respecting a global KV‑width budget.
  4. KV‑parity mapping – after factorization, reshape the low‑rank K and V matrices so that the total number of KV slots stays constant. This step ensures the converted MLA can be dropped into the existing KV‑cache logic without code changes.
  5. Optional healing fine‑tune – a short (often < 1 % of full training steps) fine‑tune with the original loss stabilizes any remaining drift.

The whole pipeline can be scripted as a drop‑in conversion tool that takes a checkpoint, a calibration set, and a KV budget, then outputs an MLA‑ready checkpoint ready for inference.

Results & Findings

Model (size)Baseline (uniform‑rank SVD)CARE (no fine‑tune)CARE + healing fine‑tune
Qwen3‑4B‑InstructPerplexity ↑ 12.3, Acc ↓ 0.4%Perplexity ↓ 215×, Acc ↑ 1.2%Accuracy fully recovered (±0.1% of original)
Llama‑3.1‑8B‑InstructPerplexity ↑ 9.8, Acc ↓ 0.6%Perplexity ↓ 180×, Acc ↑ 1.0%Same as original
Qwen3‑30B‑A3B‑InstructPerplexity ↑ 15.1, Acc ↓ 0.8%Perplexity ↓ 215×, Acc ↑ 1.70%Original performance restored

Key takeaways

  • Activation‑aware factorization reduces the mismatch between the original and converted attention outputs far more than weight‑only SVD.
  • Layer‑wise rank allocation prevents bottlenecks in deeper layers that are most sensitive to low‑rank truncation.
  • The KV‑parity mapping guarantees zero extra memory cost, a critical factor for serving large LLMs on GPUs/TPUs with limited cache.

Practical Implications

  • Deployers can upgrade existing GQA‑based models to richer MLA representations without expanding KV memory, enabling higher‑quality generation on the same hardware.
  • Inference latency stays roughly unchanged because the number of KV slots is constant; the extra matrix multiplications are offset by the lower rank in most layers.
  • Rapid conversion (minutes on a single GPU) plus a short healing fine‑tune makes it feasible to integrate CARE into CI/CD pipelines for model updates.
  • Cost savings – organizations can achieve near‑full‑precision accuracy with smaller KV budgets, reducing GPU memory pressure and allowing higher batch sizes or longer context windows.
  • Open‑source friendliness – the authors release a lightweight Python library that hooks into Hugging Face Transformers, so developers can experiment with a single convert_to_mla() call.

Limitations & Future Work

  • Calibration dependence – CARE requires a representative activation dataset; a poorly chosen calibration set can lead to sub‑optimal rank allocation.
  • Fixed KV width assumption – the method is designed for scenarios where KV cache size cannot change; extending to dynamic KV budgets (e.g., variable‑length contexts) is left for future research.
  • Heuristic rank budgeting – the current importance metric is simple (singular‑value weighted variance). More sophisticated, possibly learned, budgeting strategies could further improve performance.
  • Scope of models – experiments focus on decoder‑only LLMs; applying CARE to encoder‑decoder or vision‑language models remains an open question.

Overall, CARE offers a pragmatic, performance‑driven path to richer attention mechanisms without the usual memory penalty, making it a valuable tool for developers pushing the limits of LLM inference.

Authors

  • Zhongzhu Zhou
  • Fengxiang Bie
  • Ziyan Chen
  • Zhenyu Zhang
  • Yibo Yang
  • Junxiong Wang
  • Ben Athiwaratkun
  • Xiaoxia Wu
  • Shuaiwen Leon Song

Paper Information

  • arXiv ID: 2603.17946v1
  • Categories: cs.LG, cs.AI
  • Published: March 18, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »