[Paper] End-to-End Context Compression at Scale

Published: 3 days ago (June 8, 2026 at 11:43 AM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.09659v1

Overview

Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model’s context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16. We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones for long-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.

Key Contributions

This paper presents research in the following areas:

cs.CL
cs.AI
cs.LG

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CL.

Authors

Ang Li
Sean McLeish
Haozhe Chen
Nimit Kalra
Zaiqian Chen
Artem Gazizov
Venkata Anoop Suhas Kumar Morisetty
Bhavya Kailkhura
Harshitha Menon
Zhuang Liu
Brian R. Bartoldson
Tom Goldstein
Sanae Lotfi
Micah Goldblum
Pavel Izmailov

Paper Information

arXiv ID: 2606.09659v1
Categories: cs.CL, cs.AI, cs.LG
Published: June 8, 2026
PDF: Download PDF

[Paper] End-to-End Context Compression at Scale

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Context-Driven Incremental Compression for Multi-Turn Dialogue Generation

[Paper] Redesign Mixture-of-Experts Routers with Manifold Power Iteration

[Paper] System Report for CCL25-Eval Task 5: New Dataset and LoRA-Fine-Tuned Qwen2.5

[Paper] Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling