[Paper] ClarifyMT-Bench: Benchmarking and Improving Multi-Turn Clarification for Conversational Large Language Models

Published: 1 month ago (December 24, 2025 at 06:39 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.21120v1

Overview

The paper introduces ClarifyMT‑Bench, a new benchmark that evaluates how well conversational large language models (LLMs) handle multi‑turn clarification when users give incomplete or ambiguous inputs. By exposing a systematic ambiguity taxonomy and realistic user personas, the authors reveal a pervasive “under‑clarification” bias in current models and propose a modular agent, ClarifyAgent, to make LLMs ask the right follow‑up questions before answering.

Key Contributions

A five‑dimensional ambiguity taxonomy (semantic, contextual, intent, knowledge, and procedural) that captures the main ways user utterances can be unclear.
Six simulated user personas (e.g., impatient, cooperative, evasive) that generate diverse dialogue flows, enabling stress‑testing of LLM behavior.
ClarifyMT‑Bench dataset: 6,120 multi‑turn dialogues created via a hybrid LLM‑human pipeline, each annotated with the underlying ambiguity source and the optimal clarification strategy.
Comprehensive evaluation of ten popular LLMs (including GPT‑4, Claude, Llama 2, etc.), uncovering a consistent tendency to answer too early and a performance drop as conversation depth grows.
ClarifyAgent: an agentic framework that decomposes clarification into four stages—perception, forecasting, tracking, and planning—yielding substantial gains across all ambiguity dimensions.
Open‑source release of the benchmark, evaluation scripts, and the ClarifyAgent codebase to foster reproducibility and further research.

Methodology

Ambiguity Taxonomy Design – The authors surveyed prior work and real‑world chat logs to define five orthogonal ambiguity axes.
Persona‑Driven Dialogue Generation – Six user personas were scripted with distinct interaction styles. An LLM (GPT‑4) generated initial user turns, which were then refined by human annotators to ensure realism.
Hybrid LLM‑Human Pipeline – Human reviewers validated the LLM‑produced clarifying questions and answers, guaranteeing that each dialogue contains a clear “optimal” clarification point.
Benchmark Construction – Each dialogue is labeled with: (a) the ambiguity type(s), (b) the turn at which clarification should occur, and (c) a reference clarification question/answer pair.
Evaluation Protocol – Models are prompted to continue the conversation; metrics include Clarification Accuracy (did the model ask the right question?), Premature Answer Rate, and Dialogue Success (final answer correctness).
ClarifyAgent Architecture – The agent first perceives the user utterance (detects ambiguity), forecasts possible user intents, tracks the dialogue state across turns, and finally plans an optimal clarification move (question or answer). Each module is implemented as a lightweight fine‑tuned transformer that can be swapped into existing LLM pipelines.

Results & Findings

Model	Premature Answer Rate	Clarification Accuracy	Dialogue Success
GPT‑4 (baseline)	38%	54%	61%
Claude‑2	42%	49%	58%
Llama 2‑13B	61%	31%	44%
ClarifyAgent + GPT‑4	12%	84%	89%

Under‑clarification bias: All ten models answered too early in >30 % of cases, with the bias worsening as the conversation deepened beyond three turns.
Ambiguity sensitivity: Semantic and intent ambiguities caused the highest premature‑answer rates, while procedural ambiguities were easier for models to detect.
ClarifyAgent impact: By explicitly separating perception and planning, the agent reduced premature answers by up to 75 % and lifted overall success close to human‑level performance on the benchmark.

Practical Implications

Better Customer‑Support Bots – Deploying ClarifyAgent‑style pipelines can prevent bots from guessing user intent, reducing mis‑routing and costly escalations.
Developer Tooling – IDE assistants (e.g., code generation chatbots) can use the taxonomy to flag ambiguous prompts (“What do you mean by ‘optimize this function’?”) before emitting potentially harmful code.
Product Design – The persona framework helps product teams simulate edge‑case user behaviors (impatient or evasive users) during QA, leading to more robust conversational UX.
Compliance & Safety – Early clarification reduces the risk of LLMs providing incorrect or unsafe answers in regulated domains (finance, healthcare).
Plug‑and‑Play Integration – ClarifyAgent’s modular design means existing LLM services can be wrapped with a lightweight clarification layer without retraining the base model.

Limitations & Future Work

Synthetic Users – Although human‑validated, the dialogues still rely on simulated personas; real‑world user studies are needed to confirm external validity.
Scalability of Modules – The four‑stage agent adds inference latency; future work should explore joint‑training or distillation to keep response times low.
Ambiguity Coverage – The five‑dimensional taxonomy may miss domain‑specific ambiguities (e.g., legal jargon); extending the taxonomy with community contributions is an open avenue.
Cross‑Language Evaluation – Current benchmark is English‑only; adapting ClarifyMT‑Bench to multilingual settings will be crucial for global deployments.

ClarifyMT‑Bench offers a concrete, reproducible yardstick for the next generation of conversational AI—one that knows when to ask before it answers. Developers eager to build more reliable chat assistants now have both a diagnostic tool and a proven solution in ClarifyAgent.

Authors

Sichun Luo
Yi Huang
Mukai Li
Shichang Meng
Fengyuan Liu
Zefa Hu
Junlan Feng
Qi Liu

Paper Information

arXiv ID: 2512.21120v1
Categories: cs.CL, cs.IR
Published: December 24, 2025
PDF: Download PDF

[Paper] ClarifyMT-Bench: Benchmarking and Improving Multi-Turn Clarification for Conversational Large Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] A2P-Vis: an Analyzer-to-Presenter Agentic Pipeline for Visual Insights Generation and Reporting

[Paper] Introducing TrGLUE and SentiTurca: A Comprehensive Benchmark for Turkish General Language Understanding and Sentiment Analysis

[Paper] Unifying Learning Dynamics and Generalization in Transformers Scaling Law

[Paper] Context as a Tool: Context Management for Long-Horizon SWE-Agents