[Paper] Learning Steerable Clarification Policies with Collaborative Self-play
Source: arXiv - 2512.04068v1
Overview
The paper tackles a core problem for AI assistants: deciding when to answer, when to list multiple possibilities, and when to ask a clarifying question for ambiguous user inputs. By framing this decision‑making as a steerable policy that can be tuned with simple cost parameters (e.g., “how expensive is it to ask a follow‑up?”), the authors show how an assistant can adapt its behavior to different devices, user preferences, or interaction modalities.
Key Contributions
- Steerable clarification policy – Introduces a model that takes explicit numerical costs for each possible action (guess, enumerate, ask) and learns to trade‑off accuracy against those costs.
- Collaborative self‑play framework – Uses two agents (a simulated user and a simulated assistant) that converse with each other, generating rich training data without human annotation.
- Reinforced Self‑Training (ReST) – A novel training loop that combines reinforcement learning (to maximize cost‑penalized accuracy) with self‑training (to bootstrap from its own predictions).
- Generalization to unseen cost settings – Demonstrates that the learned policy can adapt to cost values never seen during training, enabling on‑the‑fly steering.
- Empirical validation – Shows measurable gains in reward and downstream accuracy compared to static baselines across several benchmark datasets.
Methodology
-
Two‑agent self‑play
- User agent: Generates an ambiguous query and a hidden “true intent”.
- Assistant agent: Receives the query plus a vector of costs (e.g.,
cost_guess,cost_enumerate,cost_clarify) and decides which action to take at each turn.
-
Action space – The assistant can:
- Guess the intent and answer directly.
- Enumerate a set of plausible intents and answer each.
- Ask a clarification question (costly but may improve downstream accuracy).
-
Reward signal – After the conversation ends, the assistant receives a reward = accuracy – Σ(action costs). This encourages the model to be accurate while keeping interaction overhead low.
-
Reinforced Self‑Training (ReST)
- Reinforcement step: Policy‑gradient updates maximize the expected reward on self‑generated dialogues.
- Self‑training step: The assistant’s own high‑reward trajectories are used as pseudo‑labels to further fine‑tune the underlying language model, stabilizing training.
-
Steering mechanism – By feeding different cost vectors at inference time, developers can “steer” the assistant to be more conservative (ask more clarifications) or more aggressive (guess more often) without retraining.
Results & Findings
| Metric | Static Baseline | ReST‑trained Steerable Policy |
|---|---|---|
| Cost‑penalized accuracy (reward) | 0.62 | 0.71 (+14.5%) |
| Pure accuracy (ignoring cost) | 0.78 | 0.81 (+3.8%) |
| Average number of clarification turns | 0.0 (always guess) | 0.4 (adjustable) |
| Generalization to unseen cost vectors | 0.55 | 0.68 |
- The model reliably shifts its behavior when the cost of clarification is increased (fewer questions) or decreased (more questions).
- Even when presented with cost values outside the training distribution, performance degrades gracefully, confirming the policy’s robustness.
- Human‑in‑the‑loop evaluations (small user study) reported higher satisfaction for the steerable assistant because it respected device constraints (e.g., fewer clarifications on voice‑only devices).
Practical Implications
- Device‑aware assistants – Deploy the same model on a smartwatch (high clarification cost) and a desktop (low cost) simply by swapping the cost vector at runtime.
- User‑personalized interaction – Let users set a “clarity preference” slider; the backend translates it into cost parameters, instantly adapting the assistant’s behavior.
- Cost‑sensitive enterprise bots – In high‑throughput support settings, minimizing back‑and‑forth saves time; the policy can be tuned to prioritize speed over exhaustive clarification.
- Rapid prototyping – Developers can experiment with different trade‑offs without retraining, accelerating A/B testing of conversational strategies.
- Reduced annotation burden – Since the training data is generated via self‑play, teams can bootstrap clarification policies for new domains (e.g., medical triage, code assistance) without costly human labeling.
Limitations & Future Work
- Simulation fidelity – The user agent is a scripted simulator; real‑world user behavior (hesitation, partial answers) may differ, potentially limiting transferability.
- Scalability of cost dimensions – The current formulation assumes a small, fixed set of actions; extending to richer action spaces (e.g., multi‑modal clarifications) may require more sophisticated cost modeling.
- Reward design – The linear penalty on action cost is simplistic; future work could explore more nuanced utility functions that capture user satisfaction or latency.
- Evaluation breadth – Experiments focus on benchmark QA datasets; applying the approach to open‑domain dialogue or multi‑turn task completion remains an open avenue.
Overall, the paper presents a compelling recipe for building flexible, cost‑aware clarification strategies that can be tuned on the fly—a capability that many production AI assistants are eager to adopt.
Authors
- Jonathan Berant
- Maximillian Chen
- Adam Fisch
- Reza Aghajani
- Fantine Huot
- Mirella Lapata
- Jacob Eisenstein
Paper Information
- arXiv ID: 2512.04068v1
- Categories: cs.LG
- Published: December 3, 2025
- PDF: Download PDF