[Paper] Are LLM Decisions Faithful to Verbal Confidence?

Published: (January 12, 2026 at 12:49 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.07767v1

Overview

Large Language Models (LLMs) are getting better at talking about how sure they are about an answer, but it’s still an open question whether that verbal confidence actually guides their behavior. The paper “Are LLM Decisions Faithful to Verbal Confidence?” introduces a new evaluation framework, RiskEval, to probe whether LLMs change their “abstain‑or‑answer” strategy when the cost of a mistake varies. The findings reveal a striking mismatch: even when the optimal policy would be to say “I don’t know,” state‑of‑the‑art models keep answering, exposing a gap between confidence reporting and risk‑aware decision making.

Key Contributions

  • RiskEval framework: a systematic benchmark that couples confidence‑rated answers with configurable error penalties, enabling measurement of cost‑aware abstention behavior.
  • Empirical audit of leading LLMs (e.g., GPT‑4, Claude, Llama‑2, Gemini) showing they are not cost‑sensitive: verbal confidence scores do not translate into strategic abstention.
  • Utility collapse demonstration: under high‑penalty regimes, the mathematically optimal policy is to abstain almost always, yet models continue to answer, causing a steep drop in expected utility.
  • Insight into calibration vs. agency: the work distinguishes “calibrated confidence scores” (the model can estimate its own error probability) from “strategic agency” (the ability to act on that estimate).
  • Open‑source implementation: the authors release the RiskEval code and a suite of prompts, making it easy for the community to reproduce and extend the analysis.

Methodology

  1. Task design – The authors pick a set of knowledge‑heavy question‑answer tasks (e.g., factual trivia, commonsense reasoning). Each question is presented to the LLM with a request to output both an answer and a verbal confidence (e.g., “I’m 80 % sure”).
  2. Penalty schema – For every question, a penalty for a wrong answer is sampled from a predefined distribution (low, medium, high). Correct answers receive a fixed reward (e.g., +1), while wrong answers incur the sampled penalty (e.g., –5, –20, –100).
  3. Decision rule – The model can either answer (using its generated answer) or abstain (output “I don’t know”). If it abstains, it receives a neutral payoff (0).
  4. RiskEval metric – The framework computes the expected utility of each model under each penalty regime, comparing the observed abstention rate to the optimal policy derived from the model’s own confidence scores (i.e., abstain whenever confidence < 1 / (1+penalty)).
  5. Model suite – Experiments run on several closed‑ and open‑source LLMs, with temperature set to 0 (deterministic) and also with higher sampling to test stochastic behavior.

The whole pipeline is fully scripted, allowing developers to plug in any LLM API and instantly see how “risk‑aware” it is.

Results & Findings

ModelAverage verbal confidence calibration (Brier score)Abstention rate (high‑penalty)Expected utility (high‑penalty)
GPT‑40.12 (well‑calibrated)2 %–0.78 (utility collapse)
Claude 20.151 %–0.71
Llama‑2‑70B0.21 (moderately calibrated)0 %–0.85
Gemini Pro0.133 %–0.73

Key takeaways

  • Confidence is calibrated: most models can accurately estimate the probability of being correct (low Brier scores).
  • Abstention is rare: even when the penalty makes abstaining the optimal move, models answer >97 % of the time.
  • Utility collapse: under extreme penalties, the expected utility becomes negative, meaning the model’s behavior would be harmful in a real‑world risk‑sensitive system.
  • No strategic adaptation: changing the penalty does not noticeably shift the model’s willingness to say “I don’t know.”

Practical Implications

  1. AI safety & compliance – Industries that must guarantee bounded risk (e.g., finance, healthcare, autonomous systems) cannot rely on LLM‑generated confidence scores alone; an external decision layer is needed to enforce abstention when stakes are high.
  2. Prompt engineering – Simple prompts like “If you’re not sure, say ‘I don’t know’” are insufficient. Developers may need to embed hard constraints (e.g., post‑processing filters that compare confidence to a cost‑aware threshold).
  3. Tooling for risk‑aware agents – The open‑source RiskEval can be integrated into CI pipelines to automatically audit new model releases for cost‑sensitivity before deployment.
  4. User‑facing applications – Chatbots that display confidence percentages should also expose a “skip/abstain” option that is governed by a policy aware of the downstream cost of errors (e.g., legal advice, code generation).
  5. Model fine‑tuning – The gap suggests a new fine‑tuning objective: risk‑aware decision making, where the loss function penalizes answering under high‑penalty conditions proportionally to the model’s own confidence.

Limitations & Future Work

  • Scope of tasks – The benchmark focuses on factual QA; other domains (code synthesis, multi‑modal reasoning) may exhibit different cost‑sensitivity patterns.
  • Penalty modeling – Penalties are simulated as scalar values; real‑world costs can be multi‑dimensional (legal liability, user trust) and may require richer representations.
  • Static prompting – The study does not explore dynamic prompting strategies (e.g., chain‑of‑thought that explicitly reasons about risk).
  • Model size vs. behavior – While several sizes were tested, the relationship between parameter count and strategic abstention remains under‑explored.

Future research directions include: designing risk‑aware training objectives, extending RiskEval to multi‑step decision problems, and building policy layers that translate calibrated confidence into optimal actions in production pipelines.

Authors

  • Jiawei Wang
  • Yanfei Zhou
  • Siddartha Devic
  • Deqing Fu

Paper Information

  • arXiv ID: 2601.07767v1
  • Categories: cs.LG, cs.CL
  • Published: January 12, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »