[Paper] Learning Steerable Clarification Policies with Collaborative Self-play

Published: 2 months ago (December 3, 2025 at 01:49 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.04068v1

Overview

The paper tackles a core problem for AI assistants: deciding when to answer, when to list multiple possibilities, and when to ask a clarifying question for ambiguous user inputs. By framing this decision‑making as a steerable policy that can be tuned with simple cost parameters (e.g., “how expensive is it to ask a follow‑up?”), the authors show how an assistant can adapt its behavior to different devices, user preferences, or interaction modalities.

Key Contributions

Steerable clarification policy: Introduces a model that takes explicit numerical costs for each possible action (guess, enumerate, ask) and learns to trade‑off accuracy against those costs.
Collaborative self‑play framework: Uses two agents (a simulated user and a simulated assistant) that converse with each other, generating rich training data without human annotation.
Reinforced Self‑Training (ReST): A novel training loop that combines reinforcement learning (to maximize cost‑penalized accuracy) with self‑training (to bootstrap from its own predictions).
Generalization to unseen cost settings: Demonstrates that the learned policy can adapt to cost values never seen during training, enabling on‑the‑fly steering.
Empirical validation: Shows measurable gains in reward and downstream accuracy compared to static baselines across several benchmark datasets.

Methodology

Two‑agent self‑play –
- User agent: Generates an ambiguous query and a hidden “true intent”.
- Assistant agent: Receives the query plus a vector of costs (e.g., cost_guess, cost_enumerate, cost_clarify) and decides which action to take at each turn.
Action space – The assistant can:
- Guess the intent and answer directly.
- Enumerate a set of plausible intents and answer each.
- Ask a clarification question (costly but may improve downstream accuracy).
Reward signal – After the conversation ends, the assistant receives a reward = accuracy – Σ(action costs). This encourages the model to be accurate while keeping interaction overhead low.
Reinforced Self‑Training (ReST) –
- Reinforcement step: Policy gradient updates maximize the expected reward on self‑generated dialogues.
- Self‑training step: The assistant’s own high‑reward trajectories are used as pseudo‑labels to further fine‑tune the underlying language model, stabilizing training.
Steering mechanism – By feeding different cost vectors at inference time, developers can “steer” the assistant to be more conservative (ask more clarifications) or more aggressive (guess more often) without retraining.

Results & Findings

Metric	Static Baseline	ReST‑trained Steerable Policy
Cost‑penalized accuracy (reward)	0.62	0.71 (+14.5%)
Pure accuracy (ignoring cost)	0.78	0.81 (+3.8%)
Average number of clarification turns	0.0 (always guess)	0.4 (adjustable)
Generalization to unseen cost vectors	0.55	0.68

The model reliably shifts its behavior when the cost of clarification is increased (fewer questions) or decreased (more questions).
Even when presented with cost values outside the training distribution, performance degrades gracefully, confirming the policy’s robustness.
Human‑in‑the‑loop evaluations (small user study) reported higher satisfaction for the steerable assistant because it respected device constraints (e.g., fewer clarifications on voice‑only devices).

Practical Implications

Device‑aware assistants: Deploy the same model on a smartwatch (high clarification cost) and a desktop (low cost) simply by swapping the cost vector at runtime.
User‑personalized interaction: Let users set a “clarity preference” slider; the backend translates it into cost parameters, instantly adapting the assistant’s behavior.
Cost‑sensitive enterprise bots: In high‑throughput support settings, minimizing back‑and‑forth saves time; the policy can be tuned to prioritize speed over exhaustive clarification.
Rapid prototyping: Developers can experiment with different trade‑offs without retraining, accelerating A/B testing of conversational strategies.
Reduced annotation burden: Since the training data is generated via self‑play, teams can bootstrap clarification policies for new domains (e.g., medical triage, code assistance) without costly human labeling.

Limitations & Future Work

Simulation fidelity: The user agent is a scripted simulator; real‑world user behavior (hesitation, partial answers) may differ, potentially limiting transferability.
Scalability of cost dimensions: The current formulation assumes a small, fixed set of actions; extending to richer action spaces (e.g., multi‑modal clarifications) may require more sophisticated cost modeling.
Reward design: The linear penalty on action cost is simplistic; future work could explore more nuanced utility functions that capture user satisfaction or latency.
Evaluation breadth: Experiments focus on benchmark QA datasets; applying the approach to open‑domain dialogue or multi‑turn task completion remains an open avenue.

Overall, the paper presents a compelling recipe for building flexible, cost‑aware clarification strategies that can be tuned on the fly—a capability that many production AI assistants are eager to adopt.

Authors

Jonathan Berant
Maximillian Chen
Adam Fisch
Reza Aghajani
Fantine Huot
Mirella Lapata
Jacob Eisenstein

Paper Information

arXiv ID: 2512.04068v1
Categories: cs.LG
Published: December 3, 2025
PDF: Download PDF

[Paper] Learning Steerable Clarification Policies with Collaborative Self-play

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Enhancing Retrieval-Augmented Generation with Entity Linking for Educational Platforms

[Paper] Training-Time Action Conditioning for Efficient Real-Time Chunking

[Paper] Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity

[Paper] AQUA-Net: Adaptive Frequency Fusion and Illumination Aware Network for Underwater Image Enhancement