[Paper] AdaGReS:Adaptive Greedy Context Selection via Redundancy-Aware Scoring for Token-Budgeted RAG
Source: arXiv - 2512.25052v1
Overview
Retrieval‑augmented generation (RAG) systems rely on pulling in external text snippets (contexts) to help large language models answer questions or complete tasks. The paper AdaGReS tackles a surprisingly common problem: the top‑k retrieval step often returns many overlapping or duplicate chunks, squandering the limited token budget that a model can process and hurting answer quality. AdaGReS introduces a redundancy‑aware selection algorithm that balances relevance with diversity, automatically calibrates its trade‑off, and comes with provable near‑optimal guarantees.
Key Contributions
- Redundancy‑aware objective: Formulates context selection as a set‑level optimization that rewards relevance to the query while penalizing intra‑set similarity.
- Greedy token‑budgeted selection: Derives marginal gains for each candidate chunk and selects chunks greedily until the token budget is exhausted.
- Adaptive relevance‑redundancy trade‑off: Proposes a closed‑form, instance‑specific calibration of the λ parameter (relevance vs. redundancy) based on statistics of the candidate pool and the token budget, removing the need for manual tuning.
- Theoretical guarantee: Shows the objective is ε‑approximate submodular under realistic embedding similarity assumptions, giving the greedy algorithm a provable (1‑1/e‑ε) approximation bound.
- Empirical validation: Demonstrates consistent reductions in redundancy and improvements in downstream QA performance on Natural Questions and a high‑redundancy biomedical drug corpus.
Methodology
-
Candidate generation: For a given query, a dense retriever returns a pool of N text chunks (e.g., 100). Each chunk has an embedding similarity score sᵢ indicating relevance.
-
Scoring function:
[ F(S) = \sum_{i\in S} s_i ;-; \lambda \sum_{i,j\in S, i<j} \text{sim}(c_i,c_j) ]
- The first term rewards relevance.
- The second term penalizes redundancy using pairwise cosine similarity between chunk embeddings.
- λ controls the trade‑off.
-
Adaptive λ: Instead of fixing λ, AdaGReS computes it analytically from the mean/variance of relevance scores and the token budget B. The formula scales λ up when the pool is highly redundant or the budget is tight, and down when relevance dominates.
-
Greedy selection under budget: Starting with an empty set, the algorithm repeatedly adds the chunk that yields the largest marginal gain ΔF while keeping the total token count ≤ B. Because the objective is (approximately) submodular, this greedy process is near‑optimal.
-
Implementation details: The method plugs into any existing retriever‑generator pipeline with only a few extra lines of code (computing pairwise similarities and the adaptive λ). No re‑training of the retriever or generator is required.
Results & Findings
| Dataset | Baseline (top‑k) | AdaGReS | Redundancy ↓ | Token budget usage ↑ | End‑to‑end EM ↑ |
|---|---|---|---|---|---|
| Natural Questions | 10 chunks (≈ 800 tokens) | 8 chunks (≈ 650 tokens) | 32 % fewer duplicate tokens | 19 % tokens saved | +2.1 % exact match |
| Biomedical drug corpus | 12 chunks (≈ 900 tokens) | 7 chunks (≈ 540 tokens) | 45 % redundancy reduction | 40 % tokens saved | +3.4 % exact match |
- Redundancy control: AdaGReS consistently selects fewer overlapping chunks, freeing up tokens for more diverse information.
- Answer quality: The modest token savings translate into measurable gains in exact‑match and F1 scores for open‑domain QA.
- Robustness: Across varying budgets (from 300 to 1200 tokens) and different retrieval models (dense vs. BM25), AdaGReS maintains its advantage, showing that the adaptive λ truly adapts to pool characteristics.
Practical Implications
- Cost‑effective inference: By squeezing out redundant tokens, developers can stay within tighter token limits (e.g., for API pricing or latency constraints) without sacrificing answer quality.
- Plug‑and‑play improvement: Existing RAG pipelines (LangChain, LlamaIndex, Haystack, etc.) can integrate AdaGReS with minimal code changes—just replace the top‑k slice with the greedy selector.
- Domain‑specific gains: In fields with naturally high overlap (legal documents, biomedical literature, product manuals), the redundancy penalty yields larger token savings and clearer, more concise context for the LLM.
- Better user experience: End‑users see more accurate and less “hallucinated” answers because the model receives higher‑quality, non‑redundant evidence.
- Scalable to large corpora: The algorithm’s complexity is O(N log N) for sorting plus O(N · k) for marginal gain updates, which is tractable even for thousands of candidates when combined with approximate nearest‑neighbor search.
Limitations & Future Work
- Pairwise similarity cost: Computing all pairwise similarities scales quadratically with the candidate pool; the authors mitigate this with approximate clustering, but very large pools may still be expensive.
- Embedding dependence: The redundancy penalty relies on the quality of chunk embeddings; poor embeddings could misjudge similarity and either over‑penalize or under‑penalize redundancy.
- Single‑modal focus: The current formulation assumes textual chunks; extending to multimodal evidence (tables, figures, code snippets) would require new similarity measures.
- User‑controlled trade‑off: While the adaptive λ removes manual tuning, some applications might want explicit control over relevance vs. diversity—future work could expose a higher‑level “budget aggressiveness” knob.
- End‑to‑end training: Integrating the selection objective into a differentiable retriever‑generator loop could further improve performance, an avenue the authors suggest for follow‑up research.
Authors
- Chao Peng
- Bin Wang
- Zhilei Long
- Jinfang Sheng
Paper Information
- arXiv ID: 2512.25052v1
- Categories: cs.CL, cs.AI, cs.IR
- Published: December 31 2025
- PDF: Download PDF