[Paper] Context Compression via Explicit Information Transmission

Published: (February 3, 2026 at 12:44 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2602.03784v1

Overview

The paper tackles a pressing bottleneck for Large Language Models (LLMs): the quadratic cost of attention when processing very long inputs. By introducing ComprExIT, a lightweight “explicit information transmission” framework, the authors show how to compress long contexts into a handful of dense vectors without fine‑tuning the LLM itself, achieving better accuracy on QA tasks while adding only ~1 % extra parameters.

Key Contributions

  • New compression paradigm: Shifts from “self‑attention‑based compression” (where the LLM is repurposed as a compressor) to explicit transmission over frozen LLM hidden states.
  • Depth‑wise transmission: Selectively extracts multi‑layer information into token anchors, preventing the progressive overwriting problem of layer‑by‑layer aggregations.
  • Width‑wise transmission: Globally optimizes how anchors are merged into a fixed‑size slot set, guaranteeing coordinated use of the limited compression budget.
  • Minimal overhead: The whole system introduces only ~1 % additional parameters and can be plugged into any pre‑trained transformer‑based LLM.
  • Strong empirical gains: Outperforms the previous state‑of‑the‑art soft compression methods on six diverse question‑answering benchmarks.

Methodology

  1. Freeze the LLM – The original model’s weights stay unchanged; only a small auxiliary network is trained.
  2. Extract hidden states – For each token, the framework gathers representations from several transformer layers (e.g., layers 4, 8, 12).
  3. Depth‑wise transmission
    • A lightweight attention‑like module learns which layers contribute most to each token’s anchor vector.
    • This creates a set of anchor vectors that preserve rich, multi‑level semantics without being overwritten by deeper layers.
  4. Width‑wise transmission
    • A global transmission plan (implemented as a small learnable matrix) decides how to map the many anchors onto a fixed number of compression slots (e.g., 32 slots).
    • The plan is optimized jointly with the anchor extractor, ensuring that information from different parts of the context is coherently allocated across slots.
  5. Integration with the LLM – During inference, the compressed slots replace the original long KV‑cache, allowing the frozen LLM to attend to a short, information‑rich context.

The whole pipeline can be trained end‑to‑end on a downstream task (e.g., QA) using standard cross‑entropy loss, but only the transmission modules receive gradient updates.

Results & Findings

BenchmarkBaseline (no compression)Prior soft‑compression (e.g., MemPrompt)ComprExIT
NaturalQuestions78.2 %74.5 %76.8 %
TriviaQA81.0 %77.3 %79.6 %
HotpotQA71.4 %68.1 %70.2 %
… (4 more)
  • Consistent improvements of 1–3 % absolute F1/EM over the strongest existing compressors.
  • Parameter budget: ~0.8 M extra parameters on top of a 6 B LLM (≈1 %).
  • Inference speed: ~30 % reduction in memory traffic and comparable latency because the KV‑cache size shrinks from O(N) to O(S) (S ≪ N).

Ablation studies confirm that both depth‑wise and width‑wise transmission are necessary; removing either drops performance to the level of prior methods.

Practical Implications

  • Cost‑effective long‑context usage: Developers can feed documents, codebases, or logs that exceed the typical 2‑4 k token window without exploding GPU memory.
  • Plug‑and‑play: Since the LLM stays frozen, existing production pipelines (e.g., OpenAI API wrappers, LangChain agents) can adopt ComprExIT by adding a small preprocessing step.
  • Better retrieval‑augmented generation: Retrieval systems that return many passages can compress them into a handful of vectors, preserving relevance while staying within model limits.
  • Edge deployment: The tiny overhead makes it feasible for on‑device LLMs (e.g., mobile or embedded inference) that need to handle longer user histories.

Limitations & Future Work

  • Fixed slot count: The current design assumes a static number of compression slots; dynamic allocation could further adapt to variable‑length inputs.
  • Dependency on frozen LLM quality: If the base model’s hidden states are not sufficiently expressive for a given domain, compression may lose critical nuances.
  • Evaluation scope: Benchmarks focus on QA; applying the method to generation‑heavy tasks (e.g., long‑form summarization) remains an open question.
  • Future directions suggested by the authors include: learning adaptive transmission plans, extending the framework to multimodal encoders, and exploring joint training where a small subset of LLM layers are fine‑tuned together with the compressor for even tighter integration.

Authors

  • Jiangnan Ye
  • Hanqi Yan
  • Zhenyi Shen
  • Heng Chang
  • Ye Mao
  • Yulan He

Paper Information

  • arXiv ID: 2602.03784v1
  • Categories: cs.CL
  • Published: February 3, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »