[Paper] Context Compression via Explicit Information Transmission

Published: 3 months ago (February 3, 2026 at 12:44 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.03784v1

Overview

The paper tackles a pressing bottleneck for Large Language Models (LLMs): the quadratic cost of attention when processing very long inputs. By introducing ComprExIT, a lightweight “explicit information transmission” framework, the authors show how to compress long contexts into a handful of dense vectors without fine‑tuning the LLM itself, achieving better accuracy on QA tasks while adding only ~1 % extra parameters.

Key Contributions

New compression paradigm: Shifts from “self‑attention‑based compression” (where the LLM is repurposed as a compressor) to explicit transmission over frozen LLM hidden states.
Depth‑wise transmission: Selectively extracts multi‑layer information into token anchors, preventing the progressive overwriting problem of layer‑by‑layer aggregations.
Width‑wise transmission: Globally optimizes how anchors are merged into a fixed‑size slot set, guaranteeing coordinated use of the limited compression budget.
Minimal overhead: The whole system introduces only ~1 % additional parameters and can be plugged into any pre‑trained transformer‑based LLM.
Strong empirical gains: Outperforms the previous state‑of‑the‑art soft compression methods on six diverse question‑answering benchmarks.

Methodology

Freeze the LLM – The original model’s weights stay unchanged; only a small auxiliary network is trained.
Extract hidden states – For each token, the framework gathers representations from several transformer layers (e.g., layers 4, 8, 12).
Depth‑wise transmission
- A lightweight attention‑like module learns which layers contribute most to each token’s anchor vector.
- This creates a set of anchor vectors that preserve rich, multi‑level semantics without being overwritten by deeper layers.
Width‑wise transmission
- A global transmission plan (implemented as a small learnable matrix) decides how to map the many anchors onto a fixed number of compression slots (e.g., 32 slots).
- The plan is optimized jointly with the anchor extractor, ensuring that information from different parts of the context is coherently allocated across slots.
Integration with the LLM – During inference, the compressed slots replace the original long KV‑cache, allowing the frozen LLM to attend to a short, information‑rich context.

The whole pipeline can be trained end‑to‑end on a downstream task (e.g., QA) using standard cross‑entropy loss, but only the transmission modules receive gradient updates.

Results & Findings

Benchmark	Baseline (no compression)	Prior soft‑compression (e.g., MemPrompt)	ComprExIT
NaturalQuestions	78.2 %	74.5 %	76.8 %
TriviaQA	81.0 %	77.3 %	79.6 %
HotpotQA	71.4 %	68.1 %	70.2 %
… (4 more)	—	—	—

Consistent improvements of 1–3 % absolute F1/EM over the strongest existing compressors.
Parameter budget: ~0.8 M extra parameters on top of a 6 B LLM (≈1 %).
Inference speed: ~30 % reduction in memory traffic and comparable latency because the KV‑cache size shrinks from O(N) to O(S) (S ≪ N).

Ablation studies confirm that both depth‑wise and width‑wise transmission are necessary; removing either drops performance to the level of prior methods.

Practical Implications

Cost‑effective long‑context usage: Developers can feed documents, codebases, or logs that exceed the typical 2‑4 k token window without exploding GPU memory.
Plug‑and‑play: Since the LLM stays frozen, existing production pipelines (e.g., OpenAI API wrappers, LangChain agents) can adopt ComprExIT by adding a small preprocessing step.
Better retrieval‑augmented generation: Retrieval systems that return many passages can compress them into a handful of vectors, preserving relevance while staying within model limits.
Edge deployment: The tiny overhead makes it feasible for on‑device LLMs (e.g., mobile or embedded inference) that need to handle longer user histories.

Limitations & Future Work

Fixed slot count: The current design assumes a static number of compression slots; dynamic allocation could further adapt to variable‑length inputs.
Dependency on frozen LLM quality: If the base model’s hidden states are not sufficiently expressive for a given domain, compression may lose critical nuances.
Evaluation scope: Benchmarks focus on QA; applying the method to generation‑heavy tasks (e.g., long‑form summarization) remains an open question.
Future directions suggested by the authors include: learning adaptive transmission plans, extending the framework to multimodal encoders, and exploring joint training where a small subset of LLM layers are fine‑tuned together with the compressor for even tighter integration.

Authors

Jiangnan Ye
Hanqi Yan
Zhenyi Shen
Heng Chang
Ye Mao
Yulan He

Paper Information

arXiv ID: 2602.03784v1
Categories: cs.CL
Published: February 3, 2026
PDF: Download PDF

[Paper] Context Compression via Explicit Information Transmission

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

[Paper] Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay

[Paper] Uncovering Cross-Objective Interference in Multi-Objective Alignment

[Paper] SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks