[Paper] Context Compression via Explicit Information Transmission
Source: arXiv - 2602.03784v1
Overview
The paper tackles a pressing bottleneck for Large Language Models (LLMs): the quadratic cost of attention when processing very long inputs. By introducing ComprExIT, a lightweight “explicit information transmission” framework, the authors show how to compress long contexts into a handful of dense vectors without fine‑tuning the LLM itself, achieving better accuracy on QA tasks while adding only ~1 % extra parameters.
Key Contributions
- New compression paradigm: Shifts from “self‑attention‑based compression” (where the LLM is repurposed as a compressor) to explicit transmission over frozen LLM hidden states.
- Depth‑wise transmission: Selectively extracts multi‑layer information into token anchors, preventing the progressive overwriting problem of layer‑by‑layer aggregations.
- Width‑wise transmission: Globally optimizes how anchors are merged into a fixed‑size slot set, guaranteeing coordinated use of the limited compression budget.
- Minimal overhead: The whole system introduces only ~1 % additional parameters and can be plugged into any pre‑trained transformer‑based LLM.
- Strong empirical gains: Outperforms the previous state‑of‑the‑art soft compression methods on six diverse question‑answering benchmarks.
Methodology
- Freeze the LLM – The original model’s weights stay unchanged; only a small auxiliary network is trained.
- Extract hidden states – For each token, the framework gathers representations from several transformer layers (e.g., layers 4, 8, 12).
- Depth‑wise transmission
- A lightweight attention‑like module learns which layers contribute most to each token’s anchor vector.
- This creates a set of anchor vectors that preserve rich, multi‑level semantics without being overwritten by deeper layers.
- Width‑wise transmission
- A global transmission plan (implemented as a small learnable matrix) decides how to map the many anchors onto a fixed number of compression slots (e.g., 32 slots).
- The plan is optimized jointly with the anchor extractor, ensuring that information from different parts of the context is coherently allocated across slots.
- Integration with the LLM – During inference, the compressed slots replace the original long KV‑cache, allowing the frozen LLM to attend to a short, information‑rich context.
The whole pipeline can be trained end‑to‑end on a downstream task (e.g., QA) using standard cross‑entropy loss, but only the transmission modules receive gradient updates.
Results & Findings
| Benchmark | Baseline (no compression) | Prior soft‑compression (e.g., MemPrompt) | ComprExIT |
|---|---|---|---|
| NaturalQuestions | 78.2 % | 74.5 % | 76.8 % |
| TriviaQA | 81.0 % | 77.3 % | 79.6 % |
| HotpotQA | 71.4 % | 68.1 % | 70.2 % |
| … (4 more) | — | — | — |
- Consistent improvements of 1–3 % absolute F1/EM over the strongest existing compressors.
- Parameter budget: ~0.8 M extra parameters on top of a 6 B LLM (≈1 %).
- Inference speed: ~30 % reduction in memory traffic and comparable latency because the KV‑cache size shrinks from O(N) to O(S) (S ≪ N).
Ablation studies confirm that both depth‑wise and width‑wise transmission are necessary; removing either drops performance to the level of prior methods.
Practical Implications
- Cost‑effective long‑context usage: Developers can feed documents, codebases, or logs that exceed the typical 2‑4 k token window without exploding GPU memory.
- Plug‑and‑play: Since the LLM stays frozen, existing production pipelines (e.g., OpenAI API wrappers, LangChain agents) can adopt ComprExIT by adding a small preprocessing step.
- Better retrieval‑augmented generation: Retrieval systems that return many passages can compress them into a handful of vectors, preserving relevance while staying within model limits.
- Edge deployment: The tiny overhead makes it feasible for on‑device LLMs (e.g., mobile or embedded inference) that need to handle longer user histories.
Limitations & Future Work
- Fixed slot count: The current design assumes a static number of compression slots; dynamic allocation could further adapt to variable‑length inputs.
- Dependency on frozen LLM quality: If the base model’s hidden states are not sufficiently expressive for a given domain, compression may lose critical nuances.
- Evaluation scope: Benchmarks focus on QA; applying the method to generation‑heavy tasks (e.g., long‑form summarization) remains an open question.
- Future directions suggested by the authors include: learning adaptive transmission plans, extending the framework to multimodal encoders, and exploring joint training where a small subset of LLM layers are fine‑tuned together with the compressor for even tighter integration.
Authors
- Jiangnan Ye
- Hanqi Yan
- Zhenyi Shen
- Heng Chang
- Ye Mao
- Yulan He
Paper Information
- arXiv ID: 2602.03784v1
- Categories: cs.CL
- Published: February 3, 2026
- PDF: Download PDF