[Paper] MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations
Source: arXiv - 2602.23184v1
Overview
The paper introduces MTRAG‑UN, a new benchmark designed to stress‑test multi‑turn Retrieval‑Augmented Generation (RAG) systems—those that combine large language models (LLMs) with external knowledge sources. By assembling 666 tasks (over 2,800 dialogue turns) across six real‑world domains, the authors expose three “UN‑” failure modes that current RAG pipelines still struggle with: UNanswerable, UNderspecified, and NONstandalone queries, plus UNclear responses.
Key Contributions
- Comprehensive benchmark: 666 multi‑turn conversational tasks (≈2.8 k turns) covering six diverse domains (e.g., finance, healthcare, tech support).
- Explicit “UN‑” taxonomy: Formal definition and annotation of four open challenges—UNanswerable, UNderspecified, NONstandalone, and UNclear—that go beyond classic retrieval or generation errors.
- Curated corpora: For each domain, the authors provide the underlying document collections that RAG systems should retrieve from, enabling reproducible end‑to‑end experiments.
- Baseline evaluation: Systematic testing of several state‑of‑the‑art retrieval (e.g., BM25, dense retrievers) and generation models (e.g., GPT‑3.5, LLaMA‑2) on the benchmark, revealing systematic performance gaps.
- Open‑source release: Full dataset, evaluation scripts, and baseline checkpoints are publicly available on GitHub, encouraging community contributions.
Methodology
- Task design: Conversational scenarios were crafted by domain experts, then split into turns where the user asks a question and the system must retrieve relevant passages and generate an answer.
- UN‑labeling: Each user turn was manually annotated with one (or more) of the four “UN‑” categories:
- UNanswerable – no supporting evidence exists in the provided corpus.
- UNderspecified – the question lacks enough detail for a precise answer.
- NONstandalone – the query depends on prior context that is missing or ambiguous.
- UNclear – the system’s generated response is vague, contradictory, or otherwise unintelligible.
- Retrieval‑generation pipeline: Baseline experiments follow the typical RAG flow: (a) retrieve top‑k passages using either sparse (BM25) or dense (e.g., DPR) methods, (b) feed the retrieved text plus the dialogue history into a generative LLM, (c) post‑process the output.
- Evaluation metrics: Standard QA metrics (Exact Match, F1) are combined with custom “UN‑score” measures that penalize failures specific to each category, providing a more nuanced view of system robustness.
Results & Findings
- Overall drop in QA scores when UN‑type turns are present: Exact Match fell from ~45 % on “clean” turns to ~22 % on UNanswerable ones.
- Retrieval bottleneck: Dense retrievers performed slightly better on UNderspecified queries (by retrieving broader context) but still missed many relevant documents, indicating that retrieval alone cannot resolve underspecification.
- Generation weakness: Even when the correct passage was retrieved, LLMs often produced UNclear responses—e.g., hedging language (“I’m not sure”) or hallucinated details.
- Cross‑domain consistency: The difficulty patterns held across all six domains, suggesting that the UN‑issues are fundamental to multi‑turn RAG rather than domain‑specific quirks.
Practical Implications
- Product developers building chat‑based assistants (customer support bots, internal knowledge bases, etc.) should anticipate and explicitly handle UN‑type queries—e.g., by detecting when a question is unanswerable and gracefully deferring to a human.
- Prompt engineering: Adding clarification prompts (“Could you specify the time period?”) can mitigate UNderspecified and NONstandalone failures, improving user experience without retraining the model.
- Retrieval layer upgrades: Investing in hybrid retrieval (combining sparse and dense methods) and relevance feedback loops can reduce UNanswerable cases by expanding the searchable corpus dynamically.
- Evaluation pipelines: Incorporating the MTRAG‑UN benchmark (or its scoring scripts) into CI/CD for conversational AI ensures that new model releases are vetted against these realistic failure modes before deployment.
Limitations & Future Work
- Scale of domains: While six domains provide breadth, they still omit highly regulated sectors (e.g., legal, aviation) where UN‑type challenges may be more severe.
- Human annotation cost: The UN‑labeling process required expert annotators; scaling to larger corpora may need semi‑automated labeling or active‑learning approaches.
- Model diversity: Experiments focused on a handful of open‑source and commercial LLMs; future work could explore newer instruction‑tuned models or multimodal retrievers.
- Dynamic knowledge: The benchmark uses static corpora; extending it to streaming or time‑sensitive data (news feeds, logs) would test RAG systems’ ability to handle evolving information.
The MTRAG‑UN benchmark opens a concrete pathway for the community to diagnose and close the gap between impressive LLM capabilities and the messy reality of multi‑turn, knowledge‑driven conversations.
Authors
- Sara Rosenthal
- Yannis Katsis
- Vraj Shah
- Lihong He
- Lucian Popa
- Marina Danilevsky
Paper Information
- arXiv ID: 2602.23184v1
- Categories: cs.CL
- Published: February 26, 2026
- PDF: Download PDF