[Paper] Eliciting Behaviors in Multi-Turn Conversations

Published: 3 weeks ago (December 29, 2025 at 01:57 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.23701v1

Overview

The paper Eliciting Behaviors in Multi‑Turn Conversations examines how to coax large language models (LLMs) into revealing hidden or undesirable behaviors during a back‑and‑forth dialogue. While prior work focused on single‑turn prompts, the authors extend the idea to multi‑turn interactions and show that “online” (adaptive) methods can discover many more failure cases with a modest query budget.

Key Contributions

Analytical taxonomy of behavior‑elicitation techniques, grouping them into three families: prior‑knowledge only, offline interaction, and online interaction methods.
Unified multi‑turn formulation that bridges single‑turn and multi‑turn elicitation under a single mathematical framework.
Comprehensive empirical evaluation of all three families on automatically generated multi‑turn test cases across three benchmark tasks.
Query‑budget vs. success‑rate analysis, demonstrating that online methods achieve up to 77 % success with only a few thousand model queries, far surpassing static benchmarks.
Call for dynamic benchmarks that evolve with the model rather than relying on static, pre‑written test suites.

Methodology

Problem framing – The authors treat behavior elicitation as a search problem: given a target LLM, find a conversation (a sequence of user‑assistant turns) that triggers a specific, often unwanted, response.
Three method families
- Prior‑knowledge only: hand‑crafted prompts derived from domain expertise; no interaction with the model during search.
- Offline interaction: generate a large pool of candidate prompts, evaluate them once on the model, then pick the best ones. No further adaptation.
- Online interaction: iteratively query the model, using the feedback from each turn to refine the next prompt (e.g., reinforcement‑learning‑style or Bayesian optimization).
Generalized multi‑turn formulation – The authors extend the online approach to handle multiple dialogue turns, allowing the system to adapt its strategy after each model response.
Benchmark generation – They automatically synthesize multi‑turn test cases for three tasks (e.g., safety violations, factual errors, policy breaches) and run each method family against them.
Efficiency metrics – Two key numbers are tracked: query budget (total model calls) and success rate (percentage of test cases where the target behavior is successfully elicited).

Results & Findings

Method family	Avg. success rate*	Queries needed (≈)
Prior‑knowledge only	19 %	– (no adaptive queries)
Offline interaction	45 %	~5 k
Online interaction (multi‑turn)	77 %	~3 k

*Success rate is averaged over the three evaluation tasks.

Online multi‑turn methods consistently outperformed static baselines, even when the latter were tuned on the same tasks.
The query‑budget curve shows diminishing returns after a few thousand queries, suggesting a sweet spot for practical testing pipelines.
Existing static multi‑turn conversation benchmarks often missed failure cases that the online approach uncovered, highlighting a blind spot in current evaluation practices.

Practical Implications

Dynamic testing pipelines: Teams building LLM‑powered chatbots can integrate an online elicitation loop into their CI/CD process to automatically surface hidden bugs before release.
Safety & compliance audits: Regulators and internal compliance teams can use the multi‑turn framework to probe for policy violations that only emerge after several conversational turns.
Cost‑effective evaluation: Because the method achieves high success with only a few thousand queries, it remains affordable even for large proprietary models where API calls are expensive.
Benchmark evolution: Instead of maintaining static test suites, organizations can continuously generate fresh adversarial dialogues, keeping the evaluation relevant as models are updated.

Limitations & Future Work

The study focuses on three specific tasks; broader domain coverage (e.g., code generation, multilingual dialogue) remains to be validated.
Query budget constraints: While a few thousand queries are modest, very large models with high per‑query cost may still find this prohibitive for exhaustive testing.
The online approach relies on feedback signals (e.g., classifier scores) that may be noisy or biased; improving robustness to noisy rewards is an open challenge.
Future research could explore human‑in‑the‑loop refinements, richer multi‑modal interactions, and formal guarantees about coverage of the behavior space.

Authors

Jing Huang
Shujian Zhang
Lun Wang
Andrew Hard
Rajiv Mathews
John Lambert

Paper Information

arXiv ID: 2512.23701v1
Categories: cs.CL, cs.LG
Published: December 29, 2025
PDF: Download PDF

[Paper] Eliciting Behaviors in Multi-Turn Conversations

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] The Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI Agents

[Paper] MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models