[Paper] Learning Steerable Clarification Policies with Collaborative Self-play

Published: (December 3, 2025 at 01:49 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.04068v1

Overview

The paper tackles a core problem for AI assistants: deciding when to answer, when to list multiple possibilities, and when to ask a clarifying question for ambiguous user inputs. By framing this decision‑making as a steerable policy that can be tuned with simple cost parameters (e.g., “how expensive is it to ask a follow‑up?”), the authors show how an assistant can adapt its behavior to different devices, user preferences, or interaction modalities.

Key Contributions

  • Steerable clarification policy – Introduces a model that takes explicit numerical costs for each possible action (guess, enumerate, ask) and learns to trade‑off accuracy against those costs.
  • Collaborative self‑play framework – Uses two agents (a simulated user and a simulated assistant) that converse with each other, generating rich training data without human annotation.
  • Reinforced Self‑Training (ReST) – A novel training loop that combines reinforcement learning (to maximize cost‑penalized accuracy) with self‑training (to bootstrap from its own predictions).
  • Generalization to unseen cost settings – Demonstrates that the learned policy can adapt to cost values never seen during training, enabling on‑the‑fly steering.
  • Empirical validation – Shows measurable gains in reward and downstream accuracy compared to static baselines across several benchmark datasets.

Methodology

  1. Two‑agent self‑play

    • User agent: Generates an ambiguous query and a hidden “true intent”.
    • Assistant agent: Receives the query plus a vector of costs (e.g., cost_guess, cost_enumerate, cost_clarify) and decides which action to take at each turn.
  2. Action space – The assistant can:

    • Guess the intent and answer directly.
    • Enumerate a set of plausible intents and answer each.
    • Ask a clarification question (costly but may improve downstream accuracy).
  3. Reward signal – After the conversation ends, the assistant receives a reward = accuracy – Σ(action costs). This encourages the model to be accurate while keeping interaction overhead low.

  4. Reinforced Self‑Training (ReST)

    • Reinforcement step: Policy‑gradient updates maximize the expected reward on self‑generated dialogues.
    • Self‑training step: The assistant’s own high‑reward trajectories are used as pseudo‑labels to further fine‑tune the underlying language model, stabilizing training.
  5. Steering mechanism – By feeding different cost vectors at inference time, developers can “steer” the assistant to be more conservative (ask more clarifications) or more aggressive (guess more often) without retraining.

Results & Findings

MetricStatic BaselineReST‑trained Steerable Policy
Cost‑penalized accuracy (reward)0.620.71 (+14.5%)
Pure accuracy (ignoring cost)0.780.81 (+3.8%)
Average number of clarification turns0.0 (always guess)0.4 (adjustable)
Generalization to unseen cost vectors0.550.68
  • The model reliably shifts its behavior when the cost of clarification is increased (fewer questions) or decreased (more questions).
  • Even when presented with cost values outside the training distribution, performance degrades gracefully, confirming the policy’s robustness.
  • Human‑in‑the‑loop evaluations (small user study) reported higher satisfaction for the steerable assistant because it respected device constraints (e.g., fewer clarifications on voice‑only devices).

Practical Implications

  • Device‑aware assistants – Deploy the same model on a smartwatch (high clarification cost) and a desktop (low cost) simply by swapping the cost vector at runtime.
  • User‑personalized interaction – Let users set a “clarity preference” slider; the backend translates it into cost parameters, instantly adapting the assistant’s behavior.
  • Cost‑sensitive enterprise bots – In high‑throughput support settings, minimizing back‑and‑forth saves time; the policy can be tuned to prioritize speed over exhaustive clarification.
  • Rapid prototyping – Developers can experiment with different trade‑offs without retraining, accelerating A/B testing of conversational strategies.
  • Reduced annotation burden – Since the training data is generated via self‑play, teams can bootstrap clarification policies for new domains (e.g., medical triage, code assistance) without costly human labeling.

Limitations & Future Work

  • Simulation fidelity – The user agent is a scripted simulator; real‑world user behavior (hesitation, partial answers) may differ, potentially limiting transferability.
  • Scalability of cost dimensions – The current formulation assumes a small, fixed set of actions; extending to richer action spaces (e.g., multi‑modal clarifications) may require more sophisticated cost modeling.
  • Reward design – The linear penalty on action cost is simplistic; future work could explore more nuanced utility functions that capture user satisfaction or latency.
  • Evaluation breadth – Experiments focus on benchmark QA datasets; applying the approach to open‑domain dialogue or multi‑turn task completion remains an open avenue.

Overall, the paper presents a compelling recipe for building flexible, cost‑aware clarification strategies that can be tuned on the fly—a capability that many production AI assistants are eager to adopt.

Authors

  • Jonathan Berant
  • Maximillian Chen
  • Adam Fisch
  • Reza Aghajani
  • Fantine Huot
  • Mirella Lapata
  • Jacob Eisenstein

Paper Information

  • arXiv ID: 2512.04068v1
  • Categories: cs.LG
  • Published: December 3, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »