[Paper] AdaGReS:Adaptive Greedy Context Selection via Redundancy-Aware Scoring for Token-Budgeted RAG

Published: (December 31, 2025 at 01:48 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.25052v1

Overview

Retrieval‑augmented generation (RAG) systems rely on pulling in external text snippets (contexts) to help large language models answer questions or complete tasks. The paper AdaGReS tackles a surprisingly common problem: the top‑k retrieval step often returns many overlapping or duplicate chunks, squandering the limited token budget that a model can process and hurting answer quality. AdaGReS introduces a redundancy‑aware selection algorithm that balances relevance with diversity, automatically calibrates its trade‑off, and comes with provable near‑optimal guarantees.

Key Contributions

  • Redundancy‑aware objective: Formulates context selection as a set‑level optimization that rewards relevance to the query while penalizing intra‑set similarity.
  • Greedy token‑budgeted selection: Derives marginal gains for each candidate chunk and selects chunks greedily until the token budget is exhausted.
  • Adaptive relevance‑redundancy trade‑off: Proposes a closed‑form, instance‑specific calibration of the λ parameter (relevance vs. redundancy) based on statistics of the candidate pool and the token budget, removing the need for manual tuning.
  • Theoretical guarantee: Shows the objective is ε‑approximate submodular under realistic embedding similarity assumptions, giving the greedy algorithm a provable (1‑1/e‑ε) approximation bound.
  • Empirical validation: Demonstrates consistent reductions in redundancy and improvements in downstream QA performance on Natural Questions and a high‑redundancy biomedical drug corpus.

Methodology

  1. Candidate generation: For a given query, a dense retriever returns a pool of N text chunks (e.g., 100). Each chunk has an embedding similarity score sᵢ indicating relevance.

  2. Scoring function:

    [ F(S) = \sum_{i\in S} s_i ;-; \lambda \sum_{i,j\in S, i<j} \text{sim}(c_i,c_j) ]

    • The first term rewards relevance.
    • The second term penalizes redundancy using pairwise cosine similarity between chunk embeddings.
    • λ controls the trade‑off.
  3. Adaptive λ: Instead of fixing λ, AdaGReS computes it analytically from the mean/variance of relevance scores and the token budget B. The formula scales λ up when the pool is highly redundant or the budget is tight, and down when relevance dominates.

  4. Greedy selection under budget: Starting with an empty set, the algorithm repeatedly adds the chunk that yields the largest marginal gain ΔF while keeping the total token count ≤ B. Because the objective is (approximately) submodular, this greedy process is near‑optimal.

  5. Implementation details: The method plugs into any existing retriever‑generator pipeline with only a few extra lines of code (computing pairwise similarities and the adaptive λ). No re‑training of the retriever or generator is required.

Results & Findings

DatasetBaseline (top‑k)AdaGReSRedundancy ↓Token budget usage ↑End‑to‑end EM ↑
Natural Questions10 chunks (≈ 800 tokens)8 chunks (≈ 650 tokens)32 % fewer duplicate tokens19 % tokens saved+2.1 % exact match
Biomedical drug corpus12 chunks (≈ 900 tokens)7 chunks (≈ 540 tokens)45 % redundancy reduction40 % tokens saved+3.4 % exact match
  • Redundancy control: AdaGReS consistently selects fewer overlapping chunks, freeing up tokens for more diverse information.
  • Answer quality: The modest token savings translate into measurable gains in exact‑match and F1 scores for open‑domain QA.
  • Robustness: Across varying budgets (from 300 to 1200 tokens) and different retrieval models (dense vs. BM25), AdaGReS maintains its advantage, showing that the adaptive λ truly adapts to pool characteristics.

Practical Implications

  • Cost‑effective inference: By squeezing out redundant tokens, developers can stay within tighter token limits (e.g., for API pricing or latency constraints) without sacrificing answer quality.
  • Plug‑and‑play improvement: Existing RAG pipelines (LangChain, LlamaIndex, Haystack, etc.) can integrate AdaGReS with minimal code changes—just replace the top‑k slice with the greedy selector.
  • Domain‑specific gains: In fields with naturally high overlap (legal documents, biomedical literature, product manuals), the redundancy penalty yields larger token savings and clearer, more concise context for the LLM.
  • Better user experience: End‑users see more accurate and less “hallucinated” answers because the model receives higher‑quality, non‑redundant evidence.
  • Scalable to large corpora: The algorithm’s complexity is O(N log N) for sorting plus O(N · k) for marginal gain updates, which is tractable even for thousands of candidates when combined with approximate nearest‑neighbor search.

Limitations & Future Work

  • Pairwise similarity cost: Computing all pairwise similarities scales quadratically with the candidate pool; the authors mitigate this with approximate clustering, but very large pools may still be expensive.
  • Embedding dependence: The redundancy penalty relies on the quality of chunk embeddings; poor embeddings could misjudge similarity and either over‑penalize or under‑penalize redundancy.
  • Single‑modal focus: The current formulation assumes textual chunks; extending to multimodal evidence (tables, figures, code snippets) would require new similarity measures.
  • User‑controlled trade‑off: While the adaptive λ removes manual tuning, some applications might want explicit control over relevance vs. diversity—future work could expose a higher‑level “budget aggressiveness” knob.
  • End‑to‑end training: Integrating the selection objective into a differentiable retriever‑generator loop could further improve performance, an avenue the authors suggest for follow‑up research.

Authors

  • Chao Peng
  • Bin Wang
  • Zhilei Long
  • Jinfang Sheng

Paper Information

  • arXiv ID: 2512.25052v1
  • Categories: cs.CL, cs.AI, cs.IR
  • Published: December 31 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »