[Paper] Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers

Published: (December 3, 2025 at 10:22 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.03870v1

Overview

Transformer decoders power today’s large language models, but their key‑value (KV) cache grows linearly with sequence length, quickly exhausting GPU memory on long inputs. The paper “Reconstructing KV Caches with Cross‑layer Fusion for Enhanced Transformers” proposes a clever way to cut the KV cache in half while actually improving perplexity. By learning how to fuse the most informative keys and values from lower layers, the authors create a new decoder architecture that is both memory‑efficient and high‑performing.

Key Contributions

  • Cross‑layer KV fusion (FusedKV): Introduces a learnable fusion module that combines bottom‑layer values with middle‑layer keys to form the top‑layer KV cache.
  • FusedKV‑Lite: A stripped‑down variant that directly re‑uses bottom‑layer values and middle‑layer keys, eliminating extra I/O and further reducing memory overhead.
  • Empirical insight: Shows that, in deep decoders, values mainly originate from the bottom layer while keys draw useful signals from both bottom and middle layers.
  • Memory reduction: Achieves ~50 % KV‑cache memory savings across models from 332 M to 4 B parameters.
  • Performance boost: Delivers lower validation perplexity than the vanilla Transformer decoder despite the reduced cache.

Methodology

  1. Diagnosing KV flow – The authors instrumented standard decoders to trace where each top‑layer key/value comes from. Heat‑maps revealed a clear split: values are heavily bottom‑layer‑biased, keys are a blend of bottom and middle layers.
  2. FusedKV design
    • Fusion module: A tiny linear layer (or MLP) learns weights to mix the bottom‑layer values (V_bottom) and middle‑layer keys (K_mid).
    • Post‑RoPE fusion: The mixing happens after the rotary positional embeddings (RoPE) have been applied, so the relative position information is already baked into the vectors and does not need to be recomputed.
  3. FusedKV‑Lite variant – Skips the learnable fusion and simply copies V_bottom and K_mid into the top‑layer cache. This removes the extra read/write step, trading a modest perplexity increase for even lower latency.
  4. Training & integration – The fusion parameters are trained end‑to‑end together with the language model on standard next‑token prediction. No changes are required to the attention computation itself; the decoder just reads a smaller, fused cache.

Results & Findings

Model sizeBaseline KV memoryFusedKV memoryPerplexity (val)
332 M100 %~50 %↓ 1.8 %
1.3 B100 %~48 %↓ 2.3 %
4 B100 %~51 %↓ 2.7 %
  • Memory: Across all scales, the KV cache is cut roughly in half, directly translating to the ability to double context length or fit larger batches on the same hardware.
  • Quality: Validation perplexity consistently improves (lower is better) compared with the vanilla decoder, confirming that the fused information is richer than a naïve cache sharing scheme.
  • FusedKV‑Lite: Saves an extra ~5 % of I/O bandwidth; perplexity rises by only ~0.2 % relative to full FusedKV, still beating the baseline.

Practical Implications

  • Long‑context inference: Developers can now run 8‑k or 16‑k token prompts on a single GPU that previously capped at ~4‑k tokens, opening up use‑cases like document‑level summarization or code‑base analysis.
  • Cost reduction: Halving KV memory halves the VRAM requirement for a given context length, allowing cheaper GPU instances (e.g., A100‑40 GB instead of A100‑80 GB) to serve the same workload.
  • Deploy‑time simplicity: Because the fusion happens inside the model graph, no external cache‑management code is needed—just swap the decoder class. This makes integration straightforward for existing inference stacks (e.g., Hugging Face Transformers, vLLM).
  • Potential for fine‑tuning: The lightweight fusion parameters can be fine‑tuned on domain‑specific data, offering a cheap way to adapt large models without inflating memory.

Limitations & Future Work

  • Architectural scope: The study focuses on decoder‑only Transformers; encoder‑decoder or pure encoder models may exhibit different KV dynamics.
  • Training overhead: Introducing the fusion module adds a small number of extra parameters and a brief extra forward pass during training, which could modestly increase training time on very large models.
  • Generalization to extreme scales: Experiments stop at 4 B parameters; it remains to be seen how the approach scales to 30 B+ models where KV patterns might shift.
  • Future directions: The authors suggest exploring adaptive fusion (different weights per token) and extending the idea to multi‑query attention or sparsity‑based caches.

Bottom line: By recognizing that keys and values live in different layers of a transformer decoder, the authors devised a simple yet powerful cross‑layer fusion technique that slashes KV‑cache memory in half while actually boosting model quality. For anyone building or deploying LLM‑powered services that need long contexts, FusedKV (and its lite variant) is a practical upgrade worth trying.

Authors

  • Hongzhan Lin
  • Zhiqi Bai
  • Xinmiao Zhang
  • Sen Yang
  • Xiang Li
  • Siran Yang
  • Yunlong Xu
  • Jiaheng Liu
  • Yongchi Zhao
  • Jiamang Wang
  • Yuchi Xu
  • Wenbo Su
  • Bo Zheng

Paper Information

  • arXiv ID: 2512.03870v1
  • Categories: cs.CL
  • Published: December 3, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »