[Paper] Confidence Estimation for LLMs in Multi-turn Interactions

Published: (January 5, 2026 at 09:58 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.02179v1

Overview

The paper “Confidence Estimation for LLMs in Multi‑turn Interactions” tackles a problem that most developers hit when building chat‑based AI assistants: how can we know when the model is actually sure about its answer? While prior work has looked at confidence in single‑question settings, this study is the first to systematically explore confidence through an entire conversation, where context builds up and ambiguity should gradually disappear. The authors propose a new evaluation framework, introduce fresh metrics, and test several confidence‑estimation tricks—finding that the problem is far from solved but offering a promising direction for more trustworthy conversational agents.

Key Contributions

  • First formal benchmark for multi‑turn confidence: defines two core desiderata—per‑turn calibration and monotonicity (confidence should rise as more information is gathered).
  • InfoECE metric: a length‑normalized Expected Calibration Error that accounts for varying dialogue lengths, enabling fair comparison across conversations.
  • Hinter‑Guesser paradigm: a controllable data‑generation pipeline that creates synthetic multi‑turn dialogues with known “ground‑truth” confidence, allowing precise evaluation.
  • Comprehensive empirical study: evaluates a suite of existing confidence‑estimation methods (e.g., temperature scaling, Monte‑Carlo dropout, ensemble logits) on multi‑turn tasks, exposing systematic calibration failures.
  • P(Sufficient) probe: a lightweight logit‑based classifier that predicts whether the model has received enough context to answer correctly, achieving the best calibration/monotonicity among tested methods.

Methodology

  1. Problem Formalization – The authors model a dialogue as a sequence of turns ( (x_1, y_1), (x_2, y_2), … ). For each turn they compute a confidence score (c_t) and require:

    • Calibration: the predicted confidence should match the empirical correctness frequency.
    • Monotonicity: (c_{t+1} \ge c_t) when the new turn adds useful information.
  2. Metrics

    • InfoECE: Extends the classic Expected Calibration Error by normalizing over dialogue length, preventing long conversations from dominating the error.
    • Monotonicity Ratio: proportion of turn pairs where confidence correctly increases (or stays flat) as the conversation progresses.
  3. Dataset Construction – Hinter‑Guesser

    • Hinter: Generates a “hint” (partial context) that may be ambiguous.
    • Guesser: Supplies the missing piece that resolves the ambiguity.
      By stitching together many hinter‑guesser pairs, the authors create synthetic multi‑turn QA sets where the true answer is known and the point at which the model should become confident is controllable.
  4. Baseline Confidence Techniques – Temperature scaling, label smoothing, MC‑dropout, deep ensembles, and a logit‑margin probe.

  5. Proposed Probe – P(Sufficient) – Trains a binary classifier on the model’s final‑layer logits to predict whether the current context is sufficient for a correct answer. The probe’s output is interpreted as a confidence score.

All experiments are run on popular LLM backbones (e.g., LLaMA‑7B, GPT‑3.5) using the Hugging Face 🤗 Transformers library, making the pipeline reproducible for developers.

Results & Findings

MethodInfoECE ↓Monotonicity ↑
Temperature scaling0.210.48
MC‑dropout (10 samples)0.180.52
Deep ensemble (5 models)0.150.57
Logit‑margin probe0.130.61
P(Sufficient) (proposed)0.090.71
  • Calibration Gap: Even the strongest baselines leave a noticeable calibration error (>10 %).
  • Monotonicity Issue: Many methods produce confidence that fluctuates wildly across turns, violating the intuitive “more info → higher confidence” rule.
  • P(Sufficient) Advantage: By directly learning the “sufficiency” signal from logits, the probe improves both calibration and monotonicity, though it still falls short of perfect reliability.
  • Generalization: The probe transfers reasonably well across domains (medical QA, code assistance) but degrades when the dialogue length exceeds the training distribution.

Overall, the study shows that confidence estimation in dialogue is a harder problem than in isolated QA, and that existing tricks from single‑turn settings do not automatically carry over.

Practical Implications

  • Safety‑critical bots (e.g., autonomous agents, medical triage) can use the InfoECE metric to monitor and flag low‑confidence turns, prompting a human fallback or a clarification request.
  • Human‑in‑the‑loop workflows: Developers can surface the P(Sufficient) confidence score in UI components, letting users see when the model is “ready” to act versus when it still needs more context.
  • Dynamic prompting: A system could automatically ask follow‑up clarification questions until the confidence probe crosses a threshold, reducing hallucinations without hard‑coding a fixed number of turns.
  • Model‑agnostic tooling: Since P(Sufficient) works on raw logits, it can be wrapped around any closed‑source LLM that exposes token probabilities (e.g., OpenAI’s API), enabling quick integration into existing pipelines.
  • Evaluation standards: The InfoECE and monotonicity ratio provide new benchmarks for developers to compare confidence‑aware dialogue models, encouraging more robust testing before deployment.

Limitations & Future Work

  • Synthetic bias: The Hinter‑Guesser dataset, while controllable, may not capture the full messiness of real‑world conversations (e.g., user typos, off‑topic digressions).
  • Scalability of the probe: Training P(Sufficient) requires access to intermediate logits, which some commercial APIs hide; future work could explore black‑box approximations.
  • Long‑range dependencies: Confidence degrades for dialogues longer than those seen during training; hierarchical or memory‑augmented probes might alleviate this.
  • Beyond binary sufficiency: Extending the probe to predict why confidence is low (e.g., ambiguity, factual uncertainty) could enable more nuanced recovery strategies.

The paper lays a solid foundation for making conversational LLMs not just smarter, but also more self‑aware—an essential step toward trustworthy AI assistants that developers can safely ship.

Authors

  • Caiqi Zhang
  • Ruihan Yang
  • Xiaochen Zhu
  • Chengzu Li
  • Tiancheng Hu
  • Yijiang River Dong
  • Deqing Yang
  • Nigel Collier

Paper Information

  • arXiv ID: 2601.02179v1
  • Categories: cs.CL
  • Published: January 5, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »