[Paper] Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers

Published: 2 months ago (December 3, 2025 at 10:22 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.03870v1

Overview

Transformer decoders power today’s large language models, but their key‑value (KV) cache grows linearly with sequence length, quickly exhausting GPU memory on long inputs. The paper “Reconstructing KV Caches with Cross‑layer Fusion for Enhanced Transformers” proposes a clever way to cut the KV cache in half while actually improving perplexity. By learning how to fuse the most informative keys and values from lower layers, the authors create a new decoder architecture that is both memory‑efficient and high‑performing.

Key Contributions

Cross‑layer KV fusion (FusedKV): Introduces a learnable fusion module that combines bottom‑layer values with middle‑layer keys to form the top‑layer KV cache.
FusedKV‑Lite: A stripped‑down variant that directly re‑uses bottom‑layer values and middle‑layer keys, eliminating extra I/O and further reducing memory overhead.
Empirical insight: Shows that, in deep decoders, values mainly originate from the bottom layer while keys draw useful signals from both bottom and middle layers.
Memory reduction: Achieves ~50 % KV‑cache memory savings across models from 332 M to 4 B parameters.
Performance boost: Delivers lower validation perplexity than the vanilla Transformer decoder despite the reduced cache.

Methodology

Diagnosing KV flow – The authors instrumented standard decoders to trace where each top‑layer key/value comes from. Heat‑maps revealed a clear split: values are heavily bottom‑layer‑biased, keys are a blend of bottom and middle layers.
FusedKV design –
- Fusion module: A tiny linear layer (or MLP) learns weights to mix the bottom‑layer values (V_bottom) and middle‑layer keys (K_mid).
- Post‑RoPE fusion: The mixing happens after the rotary positional embeddings (RoPE) have been applied, so the relative position information is already baked into the vectors and does not need to be recomputed.
FusedKV‑Lite variant – Skips the learnable fusion and simply copies V_bottom and K_mid into the top‑layer cache. This removes the extra read/write step, trading a modest perplexity increase for even lower latency.
Training & integration – The fusion parameters are trained end‑to‑end together with the language model on standard next‑token prediction. No changes are required to the attention computation itself; the decoder just reads a smaller, fused cache.

Results & Findings

Model size	Baseline KV memory	FusedKV memory	Perplexity (val)
332 M	100 %	~50 %	↓ 1.8 %
1.3 B	100 %	~48 %	↓ 2.3 %
4 B	100 %	~51 %	↓ 2.7 %

Memory: Across all scales, the KV cache is cut roughly in half, directly translating to the ability to double context length or fit larger batches on the same hardware.
Quality: Validation perplexity consistently improves (lower is better) compared with the vanilla decoder, confirming that the fused information is richer than a naïve cache sharing scheme.
FusedKV‑Lite: Saves an extra ~5 % of I/O bandwidth; perplexity rises by only ~0.2 % relative to full FusedKV, still beating the baseline.

Practical Implications

Long‑context inference: Developers can now run 8‑k or 16‑k token prompts on a single GPU that previously capped at ~4‑k tokens, opening up use‑cases like document‑level summarization or code‑base analysis.
Cost reduction: Halving KV memory halves the VRAM requirement for a given context length, allowing cheaper GPU instances (e.g., A100‑40 GB instead of A100‑80 GB) to serve the same workload.
Deploy‑time simplicity: Because the fusion happens inside the model graph, no external cache‑management code is needed—just swap the decoder class. This makes integration straightforward for existing inference stacks (e.g., Hugging Face Transformers, vLLM).
Potential for fine‑tuning: The lightweight fusion parameters can be fine‑tuned on domain‑specific data, offering a cheap way to adapt large models without inflating memory.

Limitations & Future Work

Architectural scope: The study focuses on decoder‑only Transformers; encoder‑decoder or pure encoder models may exhibit different KV dynamics.
Training overhead: Introducing the fusion module adds a small number of extra parameters and a brief extra forward pass during training, which could modestly increase training time on very large models.
Generalization to extreme scales: Experiments stop at 4 B parameters; it remains to be seen how the approach scales to 30 B+ models where KV patterns might shift.
Future directions: The authors suggest exploring adaptive fusion (different weights per token) and extending the idea to multi‑query attention or sparsity‑based caches.

Bottom line: By recognizing that keys and values live in different layers of a transformer decoder, the authors devised a simple yet powerful cross‑layer fusion technique that slashes KV‑cache memory in half while actually boosting model quality. For anyone building or deploying LLM‑powered services that need long contexts, FusedKV (and its lite variant) is a practical upgrade worth trying.

Authors

Hongzhan Lin
Zhiqi Bai
Xinmiao Zhang
Sen Yang
Xiang Li
Siran Yang
Yunlong Xu
Jiaheng Liu
Yongchi Zhao
Jiamang Wang
Yuchi Xu
Wenbo Su
Bo Zheng

Paper Information

arXiv ID: 2512.03870v1
Categories: cs.CL
Published: December 3, 2025
PDF: Download PDF

[Paper] Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Enhancing Retrieval-Augmented Generation with Entity Linking for Educational Platforms

[Paper] M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

[Paper] Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding

[Paper] To Err Is Human: Systematic Quantification of Errors in Published AI Papers via LLM Analysis