[Paper] Eliciting Behaviors in Multi-Turn Conversations
Source: arXiv - 2512.23701v1
Overview
The paper Eliciting Behaviors in Multi‑Turn Conversations examines how to coax large language models (LLMs) into revealing hidden or undesirable behaviors during a back‑and‑forth dialogue. While prior work focused on single‑turn prompts, the authors extend the idea to multi‑turn interactions and show that “online” (adaptive) methods can discover many more failure cases with a modest query budget.
Key Contributions
- Analytical taxonomy of behavior‑elicitation techniques, grouping them into three families: prior‑knowledge only, offline interaction, and online interaction methods.
- Unified multi‑turn formulation that bridges single‑turn and multi‑turn elicitation under a single mathematical framework.
- Comprehensive empirical evaluation of all three families on automatically generated multi‑turn test cases across three benchmark tasks.
- Query‑budget vs. success‑rate analysis, demonstrating that online methods achieve up to 77 % success with only a few thousand model queries, far surpassing static benchmarks.
- Call for dynamic benchmarks that evolve with the model rather than relying on static, pre‑written test suites.
Methodology
- Problem framing – The authors treat behavior elicitation as a search problem: given a target LLM, find a conversation (a sequence of user‑assistant turns) that triggers a specific, often unwanted, response.
- Three method families
- Prior‑knowledge only: hand‑crafted prompts derived from domain expertise; no interaction with the model during search.
- Offline interaction: generate a large pool of candidate prompts, evaluate them once on the model, then pick the best ones. No further adaptation.
- Online interaction: iteratively query the model, using the feedback from each turn to refine the next prompt (e.g., reinforcement‑learning‑style or Bayesian optimization).
- Generalized multi‑turn formulation – The authors extend the online approach to handle multiple dialogue turns, allowing the system to adapt its strategy after each model response.
- Benchmark generation – They automatically synthesize multi‑turn test cases for three tasks (e.g., safety violations, factual errors, policy breaches) and run each method family against them.
- Efficiency metrics – Two key numbers are tracked: query budget (total model calls) and success rate (percentage of test cases where the target behavior is successfully elicited).
Results & Findings
| Method family | Avg. success rate* | Queries needed (≈) |
|---|---|---|
| Prior‑knowledge only | 19 % | – (no adaptive queries) |
| Offline interaction | 45 % | ~5 k |
| Online interaction (multi‑turn) | 77 % | ~3 k |
*Success rate is averaged over the three evaluation tasks.
- Online multi‑turn methods consistently outperformed static baselines, even when the latter were tuned on the same tasks.
- The query‑budget curve shows diminishing returns after a few thousand queries, suggesting a sweet spot for practical testing pipelines.
- Existing static multi‑turn conversation benchmarks often missed failure cases that the online approach uncovered, highlighting a blind spot in current evaluation practices.
Practical Implications
- Dynamic testing pipelines: Teams building LLM‑powered chatbots can integrate an online elicitation loop into their CI/CD process to automatically surface hidden bugs before release.
- Safety & compliance audits: Regulators and internal compliance teams can use the multi‑turn framework to probe for policy violations that only emerge after several conversational turns.
- Cost‑effective evaluation: Because the method achieves high success with only a few thousand queries, it remains affordable even for large proprietary models where API calls are expensive.
- Benchmark evolution: Instead of maintaining static test suites, organizations can continuously generate fresh adversarial dialogues, keeping the evaluation relevant as models are updated.
Limitations & Future Work
- The study focuses on three specific tasks; broader domain coverage (e.g., code generation, multilingual dialogue) remains to be validated.
- Query budget constraints: While a few thousand queries are modest, very large models with high per‑query cost may still find this prohibitive for exhaustive testing.
- The online approach relies on feedback signals (e.g., classifier scores) that may be noisy or biased; improving robustness to noisy rewards is an open challenge.
- Future research could explore human‑in‑the‑loop refinements, richer multi‑modal interactions, and formal guarantees about coverage of the behavior space.
Authors
- Jing Huang
- Shujian Zhang
- Lun Wang
- Andrew Hard
- Rajiv Mathews
- John Lambert
Paper Information
- arXiv ID: 2512.23701v1
- Categories: cs.CL, cs.LG
- Published: December 29, 2025
- PDF: Download PDF