[Paper] ClarifyMT-Bench: Benchmarking and Improving Multi-Turn Clarification for Conversational Large Language Models

Published: (December 24, 2025 at 06:39 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.21120v1

Overview

The paper introduces ClarifyMT‑Bench, a new benchmark that evaluates how well conversational large language models (LLMs) handle multi‑turn clarification when users give incomplete or ambiguous inputs. By exposing a systematic ambiguity taxonomy and realistic user personas, the authors reveal a pervasive “under‑clarification” bias in current models and propose a modular agent, ClarifyAgent, to make LLMs ask the right follow‑up questions before answering.

Key Contributions

  • A five‑dimensional ambiguity taxonomy (semantic, contextual, intent, knowledge, and procedural) that captures the main ways user utterances can be unclear.
  • Six simulated user personas (e.g., impatient, cooperative, evasive) that generate diverse dialogue flows, enabling stress‑testing of LLM behavior.
  • ClarifyMT‑Bench dataset: 6,120 multi‑turn dialogues created via a hybrid LLM‑human pipeline, each annotated with the underlying ambiguity source and the optimal clarification strategy.
  • Comprehensive evaluation of ten popular LLMs (including GPT‑4, Claude, Llama 2, etc.), uncovering a consistent tendency to answer too early and a performance drop as conversation depth grows.
  • ClarifyAgent: an agentic framework that decomposes clarification into four stages—perception, forecasting, tracking, and planning—yielding substantial gains across all ambiguity dimensions.
  • Open‑source release of the benchmark, evaluation scripts, and the ClarifyAgent codebase to foster reproducibility and further research.

Methodology

  1. Ambiguity Taxonomy Design – The authors surveyed prior work and real‑world chat logs to define five orthogonal ambiguity axes.
  2. Persona‑Driven Dialogue Generation – Six user personas were scripted with distinct interaction styles. An LLM (GPT‑4) generated initial user turns, which were then refined by human annotators to ensure realism.
  3. Hybrid LLM‑Human Pipeline – Human reviewers validated the LLM‑produced clarifying questions and answers, guaranteeing that each dialogue contains a clear “optimal” clarification point.
  4. Benchmark Construction – Each dialogue is labeled with: (a) the ambiguity type(s), (b) the turn at which clarification should occur, and (c) a reference clarification question/answer pair.
  5. Evaluation Protocol – Models are prompted to continue the conversation; metrics include Clarification Accuracy (did the model ask the right question?), Premature Answer Rate, and Dialogue Success (final answer correctness).
  6. ClarifyAgent Architecture – The agent first perceives the user utterance (detects ambiguity), forecasts possible user intents, tracks the dialogue state across turns, and finally plans an optimal clarification move (question or answer). Each module is implemented as a lightweight fine‑tuned transformer that can be swapped into existing LLM pipelines.

Results & Findings

ModelPremature Answer RateClarification AccuracyDialogue Success
GPT‑4 (baseline)38%54%61%
Claude‑242%49%58%
Llama 2‑13B61%31%44%
ClarifyAgent + GPT‑412%84%89%
  • Under‑clarification bias: All ten models answered too early in >30 % of cases, with the bias worsening as the conversation deepened beyond three turns.
  • Ambiguity sensitivity: Semantic and intent ambiguities caused the highest premature‑answer rates, while procedural ambiguities were easier for models to detect.
  • ClarifyAgent impact: By explicitly separating perception and planning, the agent reduced premature answers by up to 75 % and lifted overall success close to human‑level performance on the benchmark.

Practical Implications

  • Better Customer‑Support Bots – Deploying ClarifyAgent‑style pipelines can prevent bots from guessing user intent, reducing mis‑routing and costly escalations.
  • Developer Tooling – IDE assistants (e.g., code generation chatbots) can use the taxonomy to flag ambiguous prompts (“What do you mean by ‘optimize this function’?”) before emitting potentially harmful code.
  • Product Design – The persona framework helps product teams simulate edge‑case user behaviors (impatient or evasive users) during QA, leading to more robust conversational UX.
  • Compliance & Safety – Early clarification reduces the risk of LLMs providing incorrect or unsafe answers in regulated domains (finance, healthcare).
  • Plug‑and‑Play Integration – ClarifyAgent’s modular design means existing LLM services can be wrapped with a lightweight clarification layer without retraining the base model.

Limitations & Future Work

  • Synthetic Users – Although human‑validated, the dialogues still rely on simulated personas; real‑world user studies are needed to confirm external validity.
  • Scalability of Modules – The four‑stage agent adds inference latency; future work should explore joint‑training or distillation to keep response times low.
  • Ambiguity Coverage – The five‑dimensional taxonomy may miss domain‑specific ambiguities (e.g., legal jargon); extending the taxonomy with community contributions is an open avenue.
  • Cross‑Language Evaluation – Current benchmark is English‑only; adapting ClarifyMT‑Bench to multilingual settings will be crucial for global deployments.

ClarifyMT‑Bench offers a concrete, reproducible yardstick for the next generation of conversational AI—one that knows when to ask before it answers. Developers eager to build more reliable chat assistants now have both a diagnostic tool and a proven solution in ClarifyAgent.

Authors

  • Sichun Luo
  • Yi Huang
  • Mukai Li
  • Shichang Meng
  • Fengyuan Liu
  • Zefa Hu
  • Junlan Feng
  • Qi Liu

Paper Information

  • arXiv ID: 2512.21120v1
  • Categories: cs.CL, cs.IR
  • Published: December 24, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »