[Paper] To Retrieve or To Think? An Agentic Approach for Context Evolution
Source: arXiv - 2601.08747v1
Overview
The paper proposes Agentic Context Evolution (ACE), a new framework that lets a language model decide when to fetch external information and when to keep reasoning with what it already knows. By mimicking human metacognition, ACE cuts down on unnecessary retrieval calls, reduces token usage, and boosts accuracy on multi‑hop question‑answering tasks.
Key Contributions
- Agentic decision‑making: Introduces a central orchestrator that selects between a retriever agent and a reasoner agent via majority voting, rather than retrieving at every generation step.
- Dynamic context evolution: Keeps the prompt context compact by only adding new evidence when the orchestrator deems it beneficial.
- Efficiency gains: Demonstrates up to ~30 % fewer retrieved tokens while improving answer accuracy on benchmark datasets.
- Broad applicability: Shows the approach works across several multi‑hop QA datasets (e.g., HotpotQA, ComplexWebQuestions) without task‑specific tuning.
- Open‑source implementation: Provides code and model checkpoints for reproducibility and easy integration into existing pipelines.
Methodology
-
Three‑agent architecture
- Orchestrator: A lightweight classifier (often a small LLM) that evaluates the current context and decides the next action.
- Retriever agent: Calls an external knowledge base (e.g., dense passage retrieval) to pull new passages when needed.
- Reasoner agent: Performs chain‑of‑thought style reasoning on the existing context to refine or generate the answer.
-
Majority‑voting decision loop
- At each step, the orchestrator runs multiple “opinions” (e.g., different prompt templates) and takes a majority vote to choose retrieve or reason.
- This mimics a metacognitive check: “Do I have enough evidence, or should I look up more?”
-
Context evolution
- When retrieve is chosen, the new passages are appended, and the orchestrator re‑evaluates.
- When reason is chosen, the reasoner updates the internal answer draft without expanding the token window.
-
Training & fine‑tuning
- The orchestrator is fine‑tuned on a small labeled set indicating when retrieval helped versus when it was superfluous.
- Retriever and reasoner use off‑the‑shelf pretrained models (e.g., DPR for retrieval, GPT‑3.5‑style for reasoning).
The whole loop runs until a stopping criterion (confidence threshold or max steps) is met.
Results & Findings
| Dataset | Baseline (retrieval every step) | ACE (ours) | Token reduction |
|---|---|---|---|
| HotpotQA (full) | 78.4 % EM | 84.1 % EM | ~28 % |
| ComplexWebQuestions | 62.7 % EM | 68.3 % EM | ~31 % |
| TriviaQA (multi‑hop) | 71.5 % EM | 76.9 % EM | ~26 % |
- Accuracy boost: ACE consistently outperforms strong retrieval‑augmented baselines by 4–6 % exact‑match scores.
- Token efficiency: Because retrieval is invoked only when needed, the total number of tokens processed per question drops by roughly a quarter, translating to lower inference latency and cost.
- Ablation: Removing the majority‑voting orchestrator (i.e., random choice) collapses performance to baseline levels, confirming the importance of strategic decision making.
Practical Implications
- Cost‑effective LLM services: Cloud providers can embed ACE to cut API token bills for knowledge‑intensive applications (e.g., enterprise Q&A, support bots).
- Faster response times: Fewer retrieval calls mean lower latency, which is critical for real‑time assistants.
- Cleaner prompts: By keeping context succinct, developers avoid hitting model context‑length limits, enabling the use of larger LLMs for downstream reasoning.
- Modular integration: ACE’s three‑agent design can be dropped into existing retrieval‑augmented pipelines with minimal code changes—swap the “always‑retrieve” loop for the orchestrator decision step.
- Better user experience: Reduced hallucinations caused by irrelevant retrieved passages, leading to more trustworthy answers in high‑stakes domains (legal, medical, finance).
Limitations & Future Work
- Orchestrator reliance on labeled signals: The decision model needs a modest amount of task‑specific supervision; fully unsupervised metacognition remains an open challenge.
- Scalability of voting: Majority voting adds a small overhead; future work could explore more lightweight confidence‑based heuristics.
- Domain adaptation: Experiments focus on open‑domain QA; applying ACE to highly specialized corpora (e.g., scientific literature) may require custom retrievers.
- Explainability: While the orchestrator’s choice is transparent in principle, interpreting why it opted to retrieve versus reason still needs richer introspection tools.
Overall, ACE opens a promising path toward smarter, more economical LLM‑driven reasoning systems that know when to “look up” and when to “think”.
Authors
- Rubing Chen
- Jian Wang
- Wenjie Li
- Xiao‑Yong Wei
- Qing Li
Paper Information
- arXiv ID: 2601.08747v1
- Categories: cs.CL, cs.AI
- Published: January 13, 2026
- PDF: Download PDF