[Paper] End-to-End Context Compression at Scale

Published: (June 8, 2026 at 11:43 AM EDT)
2 min read
Source: arXiv

Source: arXiv - 2606.09659v1

Overview

Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model’s context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16. We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones for long-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.

Key Contributions

This paper presents research in the following areas:

  • cs.CL
  • cs.AI
  • cs.LG

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CL.

Authors

  • Ang Li
  • Sean McLeish
  • Haozhe Chen
  • Nimit Kalra
  • Zaiqian Chen
  • Artem Gazizov
  • Venkata Anoop Suhas Kumar Morisetty
  • Bhavya Kailkhura
  • Harshitha Menon
  • Zhuang Liu
  • Brian R. Bartoldson
  • Tom Goldstein
  • Sanae Lotfi
  • Micah Goldblum
  • Pavel Izmailov

Paper Information

  • arXiv ID: 2606.09659v1
  • Categories: cs.CL, cs.AI, cs.LG
  • Published: June 8, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »