[Paper] An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration
Source: arXiv - 2605.03989v1
Overview
The paper introduces Experience‑RAG Skill, a plug‑in module that sits between a language‑model‑based agent and a pool of retrievers. Instead of hard‑coding a single retrieval pipeline, the skill dynamically picks the most suitable retrieval strategy for the current task (e.g., factoid QA, multi‑hop reasoning, scientific verification) by consulting an “experience memory.” The authors show that this lightweight orchestration layer can boost retrieval quality across heterogeneous benchmarks while keeping the system modular and reusable.
Key Contributions
- Agent‑oriented retrieval orchestration: A pluggable “skill” that can be attached to any LLM‑agent without redesigning the whole pipeline.
- Experience‑driven strategy selection: Uses a lightweight memory of past interactions to infer which retrieval configuration (single‑pass, multi‑hop, dense vs. sparse, etc.) best fits the current query.
- Unified evidence format: Returns structured evidence (documents, passages, scores) that downstream agents can consume directly, simplifying downstream prompting.
- Strong empirical performance: Achieves an overall nDCG@10 of 0.8924 across three diverse BEIR tasks (NQ, HotpotQA, SciFact), surpassing fixed‑retriever baselines and matching adaptive routing methods.
- Modular design: The skill is decoupled from both the agent logic and the retriever implementations, enabling plug‑and‑play reuse across projects.
Methodology
- Scene Analysis: When the agent receives a user request, the skill first extracts high‑level cues (task type, query length, presence of citations, etc.) from the prompt and any available metadata.
- Experience Memory Lookup: A compact experience store (implemented as a key‑value cache or a small embedding index) holds past query‑strategy performance pairs. The skill retrieves the most similar past cases using cosine similarity on a lightweight query embedding.
- Strategy Recommendation: Based on the nearest experiences, the skill selects a retrieval configuration from a predefined pool (e.g., BM25, dense retriever, multi‑hop chain, hybrid fusion).
- Orchestration & Execution: The chosen retriever(s) run on the document corpus, and the skill aggregates the results into a structured evidence object (IDs, texts, relevance scores).
- Feedback Loop: After the agent finishes its generation, the final answer quality (e.g., correctness, citation match) is fed back to update the experience memory, allowing the system to improve over time.
The whole pipeline is lightweight: the skill adds only a few milliseconds of latency and requires no retraining of the underlying LLM or retrievers.
Results & Findings
| Benchmark | Fixed Retriever (baseline) | Adaptive‑RAG (routing) | Experience‑RAG Skill |
|---|---|---|---|
| BEIR / NQ | 0.84 nDCG@10 | 0.88 | 0.89 |
| BEIR / HotpotQA | 0.78 | 0.86 | 0.88 |
| BEIR / SciFact | 0.81 | 0.87 | 0.89 |
| Overall | 0.81 | 0.87 | 0.8924 |
- The skill consistently outperforms any single static retriever, confirming that no one pipeline works best for all tasks.
- Performance is on par with more heavyweight adaptive routing systems, despite being far simpler to integrate.
- Ablation studies show that the experience memory contributes the most gain; removing it drops nDCG@10 by ~0.04 on average.
- The structured evidence format reduces prompt length for the downstream LLM, leading to modest (≈5%) reductions in generation latency.
Practical Implications
- Plug‑and‑play for product teams: Developers can drop the Experience‑RAG Skill into existing LLM‑agent services (chatbots, code assistants, research assistants) without rewriting the retrieval layer.
- Cost efficiency: By selecting the cheapest effective retriever for easy queries and only invoking expensive multi‑hop pipelines when needed, overall API usage and compute bills can be reduced.
- Better user experience: Users receive more accurate citations and evidence, especially in domains like scientific QA or legal assistance where multi‑hop reasoning is essential.
- Rapid prototyping: Teams can experiment with new retriever types (e.g., cross‑encoder re‑rankers) by simply adding them to the pool; the skill will learn when to use them.
- Continuous improvement: The feedback loop enables on‑the‑fly learning from production traffic, making the system adapt to evolving query distributions without full model retraining.
Limitations & Future Work
- Experience memory size: The current implementation stores a fixed‑size cache; scaling to millions of interactions may require more sophisticated summarization or hierarchical indexing.
- Task cue extraction: The scene analysis relies on heuristic features; ambiguous queries can lead to sub‑optimal strategy choices.
- Retriever pool dependency: The skill cannot invent new retrieval methods; its performance is bounded by the quality of the underlying retrievers.
- Evaluation scope: Benchmarks focus on English QA; extending to multilingual or multimodal retrieval (images, code) remains an open question.
Future research directions include learning a meta‑policy with reinforcement learning, integrating richer context (user history, domain constraints), and exploring decentralized experience memories for privacy‑preserving deployments.
Authors
- Dutao Zhang
- Tian Liao
Paper Information
- arXiv ID: 2605.03989v1
- Categories: cs.AI
- Published: May 5, 2026
- PDF: Download PDF