[Paper] Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation
Source: arXiv - 2602.12235v1
Overview
Large language models (LLMs) still struggle with very long inputs, especially when compute or memory is limited. A promising workaround is soft‑compression: replace a long token stream with a compact set of learned “compressed tokens” that the model can still reason over. This paper investigates the point at which such compression starts to lose the information needed to answer a query—a regime the authors call token overflow—and proposes practical ways to detect it before the LLM is invoked.
Key Contributions
- Formal definition of “token overflow” – the condition where compressed representations no longer retain enough task‑relevant content.
- Two detection strategies:
- Query‑agnostic saturation statistics that flag tokens likely to be compressed, based purely on the compression model’s internal signals.
- Query‑aware probing classifiers that combine the compressed context with the actual query to predict overflow, achieving ~0.72 AUC‑ROC across three QA benchmarks.
- Empirical evaluation on Retrieval‑Augmented Generation (xRAG) pipelines using HotpotQA, SQuADv2, and TriviaQA, showing where overflow occurs and how detection accuracy varies with compression ratio.
- A low‑cost pre‑LLM gating mechanism that can decide whether to pass the original (uncompressed) context to the LLM, reducing the risk of hallucinations caused by over‑compression.
Methodology
- Soft‑compression backbone (xRAG) – The authors use an existing retrieval‑augmented generation architecture that first retrieves relevant passages, then compresses each passage into a fixed‑size token set via a learned encoder.
- Defining overflow – For a given query, they compare the answer quality when using the compressed representation versus the original passage. If the compressed version fails to produce a correct answer, the instance is labeled as overflow.
- Query‑agnostic detector – They extract “saturation” metrics from the compression encoder (e.g., average token entropy, norm of hidden states) that indicate how much the encoder is “filled up.” A simple threshold separates compressed from uncompressed tokens.
- Query‑aware detector – A lightweight binary classifier (logistic regression / shallow MLP) is trained on concatenated vectors of the query embedding and the compressed context embedding. The classifier learns to predict overflow from these joint features.
- Evaluation protocol – The detectors are tested on three standard QA datasets. Performance is measured with AUC‑ROC, and the impact of detection on downstream answer accuracy is reported.
Results & Findings
| Detector | Avg. AUC‑ROC | Comments |
|---|---|---|
| Saturation‑only (query‑agnostic) | ~0.60 | Good at spotting compressed tokens but struggles to differentiate overflow from benign compression. |
| Query‑aware probing classifier | ~0.72 | Consistently outperforms the baseline across HotpotQA, SQuADv2, and TriviaQA. Incorporating the query boosts detection because overflow is inherently query‑dependent. |
| End‑to‑end impact | When overflow is detected and the original context is used instead of the compressed one, answer accuracy improves by 5–9 % on the evaluated benchmarks. |
The study also shows that overflow tends to appear when the compression ratio exceeds ~4‑5× (i.e., compressing a 1 k token passage down to 200 tokens) and when the retrieved passages contain multiple distinct facts.
Practical Implications
- Pre‑LLM gating: Developers can insert a cheap overflow detector before calling an expensive LLM. If overflow is predicted, the system can fallback to the full context or a higher‑fidelity compression, preventing costly hallucinations.
- Dynamic compression budgets: Instead of a fixed compression size, services can adapt the budget per query based on the detector’s confidence, saving memory while preserving answer quality.
- Monitoring & debugging: Saturation statistics provide a quick health check for compression pipelines, useful for observability dashboards in production retrieval‑augmented systems.
- Edge deployment: On devices with strict memory limits (e.g., mobile assistants), the query‑aware detector adds only a few hundred bytes of model parameters, enabling smarter trade‑offs between latency and correctness.
Limitations & Future Work
- Dataset scope: The experiments focus on extractive QA; it remains unclear how overflow behaves for generative tasks like summarization or code generation.
- Detector simplicity: The probing classifier is shallow; more expressive models (e.g., tiny transformers) might push detection AUC higher but at the cost of added latency.
- Compression method dependence: Results are tied to the xRAG encoder; other compression schemes (e.g., quantization, LoRA‑based adapters) may exhibit different overflow patterns.
- Future directions suggested by the authors include: extending overflow detection to multi‑turn dialogues, exploring reinforcement‑learning‑based compression policies that learn to avoid overflow, and integrating the detector into end‑to‑end training loops for jointly optimized compression‑generation pipelines.
Authors
- Julia Belikova
- Danila Rozhevskii
- Dennis Svirin
- Konstantin Polev
- Alexander Panchenko
Paper Information
- arXiv ID: 2602.12235v1
- Categories: cs.CL
- Published: February 12, 2026
- PDF: Download PDF