[Paper] Eliciting Behaviors in Multi-Turn Conversations

Published: (December 29, 2025 at 01:57 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.23701v1

Overview

The paper Eliciting Behaviors in Multi‑Turn Conversations examines how to coax large language models (LLMs) into revealing hidden or undesirable behaviors during a back‑and‑forth dialogue. While prior work focused on single‑turn prompts, the authors extend the idea to multi‑turn interactions and show that “online” (adaptive) methods can discover many more failure cases with a modest query budget.

Key Contributions

  • Analytical taxonomy of behavior‑elicitation techniques, grouping them into three families: prior‑knowledge only, offline interaction, and online interaction methods.
  • Unified multi‑turn formulation that bridges single‑turn and multi‑turn elicitation under a single mathematical framework.
  • Comprehensive empirical evaluation of all three families on automatically generated multi‑turn test cases across three benchmark tasks.
  • Query‑budget vs. success‑rate analysis, demonstrating that online methods achieve up to 77 % success with only a few thousand model queries, far surpassing static benchmarks.
  • Call for dynamic benchmarks that evolve with the model rather than relying on static, pre‑written test suites.

Methodology

  1. Problem framing – The authors treat behavior elicitation as a search problem: given a target LLM, find a conversation (a sequence of user‑assistant turns) that triggers a specific, often unwanted, response.
  2. Three method families
    • Prior‑knowledge only: hand‑crafted prompts derived from domain expertise; no interaction with the model during search.
    • Offline interaction: generate a large pool of candidate prompts, evaluate them once on the model, then pick the best ones. No further adaptation.
    • Online interaction: iteratively query the model, using the feedback from each turn to refine the next prompt (e.g., reinforcement‑learning‑style or Bayesian optimization).
  3. Generalized multi‑turn formulation – The authors extend the online approach to handle multiple dialogue turns, allowing the system to adapt its strategy after each model response.
  4. Benchmark generation – They automatically synthesize multi‑turn test cases for three tasks (e.g., safety violations, factual errors, policy breaches) and run each method family against them.
  5. Efficiency metrics – Two key numbers are tracked: query budget (total model calls) and success rate (percentage of test cases where the target behavior is successfully elicited).

Results & Findings

Method familyAvg. success rate*Queries needed (≈)
Prior‑knowledge only19 %– (no adaptive queries)
Offline interaction45 %~5 k
Online interaction (multi‑turn)77 %~3 k

*Success rate is averaged over the three evaluation tasks.

  • Online multi‑turn methods consistently outperformed static baselines, even when the latter were tuned on the same tasks.
  • The query‑budget curve shows diminishing returns after a few thousand queries, suggesting a sweet spot for practical testing pipelines.
  • Existing static multi‑turn conversation benchmarks often missed failure cases that the online approach uncovered, highlighting a blind spot in current evaluation practices.

Practical Implications

  • Dynamic testing pipelines: Teams building LLM‑powered chatbots can integrate an online elicitation loop into their CI/CD process to automatically surface hidden bugs before release.
  • Safety & compliance audits: Regulators and internal compliance teams can use the multi‑turn framework to probe for policy violations that only emerge after several conversational turns.
  • Cost‑effective evaluation: Because the method achieves high success with only a few thousand queries, it remains affordable even for large proprietary models where API calls are expensive.
  • Benchmark evolution: Instead of maintaining static test suites, organizations can continuously generate fresh adversarial dialogues, keeping the evaluation relevant as models are updated.

Limitations & Future Work

  • The study focuses on three specific tasks; broader domain coverage (e.g., code generation, multilingual dialogue) remains to be validated.
  • Query budget constraints: While a few thousand queries are modest, very large models with high per‑query cost may still find this prohibitive for exhaustive testing.
  • The online approach relies on feedback signals (e.g., classifier scores) that may be noisy or biased; improving robustness to noisy rewards is an open challenge.
  • Future research could explore human‑in‑the‑loop refinements, richer multi‑modal interactions, and formal guarantees about coverage of the behavior space.

Authors

  • Jing Huang
  • Shujian Zhang
  • Lun Wang
  • Andrew Hard
  • Rajiv Mathews
  • John Lambert

Paper Information

  • arXiv ID: 2512.23701v1
  • Categories: cs.CL, cs.LG
  • Published: December 29, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »