[Paper] Agentic Uncertainty Reveals Agentic Overconfidence

Published: (February 6, 2026 at 01:49 PM EST)
4 min read
Source: arXiv

Source: arXiv:2602.06948v1

Overview

The paper Agentic Uncertainty Reveals Agentic Overconfidence investigates whether AI agents can accurately gauge their own chances of success on a given task. By probing agents’ self‑estimated success probabilities at different stages—before they start, while they’re working, and after they finish—the authors uncover a systematic tendency for agents to be overly confident, sometimes by a factor of three.

Surprisingly, the coarse, pre‑execution estimates often separate successful from failing runs better than the detailed post‑execution reviews.

Key Contributions

  • Formal definition of agentic uncertainty – a framework for eliciting an agent’s own probability of success at multiple execution points.
  • Empirical evidence of agentic overconfidence across a range of language‑model‑based agents, including cases where actual success rates are as low as 22 % but predicted success exceeds 70 %.
  • Counter‑intuitive finding: pre‑execution confidence scores (with less information) can provide sharper discrimination between successful and failed attempts than post‑execution scores.
  • Adversarial prompting technique that reframes the confidence query as a “bug‑finding” task, yielding the best calibration among tested methods.
  • Comprehensive benchmark covering several standard AI‑agent tasks (code generation, reasoning, planning) and multiple model families (GPT‑3.5, Claude, Llama‑2).

Methodology

  1. Task Suite – The authors selected a diverse set of benchmark tasks (e.g., solving SAT problems, writing Python functions, planning routes). Each task has a clear binary outcome: success or failure.

  2. Confidence Elicitation – For every task instance, the agent is asked three times to output a probability (p \in [0,1]) that it will succeed:

    • Pre‑execution – before seeing any input or performing any computation.
    • Mid‑execution – after generating an intermediate solution (e.g., a draft code snippet).
    • Post‑execution – after producing the final answer and optionally self‑checking.

    The probability is obtained via a prompting template that asks the model to “rate your confidence on a 0‑100 scale.”

  3. Calibration Metrics – The authors compute standard calibration curves, Expected Calibration Error (ECE), and Brier scores to compare predicted probabilities against actual outcomes.

  4. Adversarial Prompting – To improve calibration, they introduce a “bug‑finding” prompt:

    “Assume your answer may contain errors; how likely is it that a hidden bug exists?”

    This forces the model to adopt a more critical stance.

  5. Statistical Analysis – Paired t-tests and bootstrap confidence intervals assess whether differences between pre‑, mid‑, and post‑execution confidence are statistically significant.

Results & Findings

  • Systematic Overconfidence – Across all models, the average predicted success was 0.58 while the true success rate was 0.34, yielding an ECE of 0.21. The most extreme case: a model succeeded only 22 % of the time but reported a 77 % success probability.

  • Pre‑execution Beats Post‑execution – In 7 out of 9 task families, the pre‑execution confidence scores produced higher Area‑Under‑Curve (AUC) values for distinguishing success vs. failure than post‑execution scores (average AUC: 0.71 vs. 0.66). The advantage was modest but consistent.

  • Adversarial Prompting Improves Calibration – The bug‑finding prompt reduced ECE by ~30 % (from 0.21 to 0.15) and lowered the Brier score, indicating tighter alignment between confidence and reality.

  • Model Size Matters, but Not Linearly – Larger models tended to be slightly better calibrated, yet even the biggest (GPT‑4‑level) exhibited noticeable overconfidence.

  • Mid‑execution Scores Were Noisy – Because the agent sees only a partial solution, its confidence fluctuated wildly, offering little predictive power.

Practical Implications

  • Risk‑Aware Deployment – Developers building autonomous agents (e.g., code assistants, planning bots) should not trust raw confidence scores out‑of‑the‑box. Incorporating calibrated uncertainty estimates can prevent costly failures in production.

  • Safety Nets & Human‑in‑the‑Loop – Systems can trigger human review when the calibrated confidence falls below a safety threshold, or when the adversarial “bug‑finding” confidence spikes.

  • Prompt Engineering for Better Self‑Assessment – Reframing confidence queries as error‑detection tasks is a cheap, model‑agnostic way to obtain more reliable self‑evaluation without extra training.

  • Benchmarking Standards – The paper’s methodology can become a standard test suite for future LLM‑based agents, encouraging the community to report both performance and calibrated uncertainty.

  • Resource Allocation – Since pre‑execution confidence already provides useful discrimination, developers can decide early whether to allocate compute resources (e.g., run a more expensive verification step) based on a cheap confidence check.

Limitations & Future Work

Limitations

  • Task scope – The study concentrates on relatively short, well‑defined tasks, leaving it unclear how the findings generalize to open‑ended generation (e.g., long‑form writing).
  • Single‑shot prompting – Only a few prompting templates were examined; richer prompting strategies or few‑shot demonstrations could influence calibration.
  • Model diversity – Although several popular LLM families were evaluated, newer multimodal or instruction‑tuned models were not included.
  • Dynamic environments – Real‑time agents that interact with changing environments (e.g., robotics) may exhibit different uncertainty dynamics.

Future Directions

  1. Integrate uncertainty calibration into the training objective.
  2. Extend the framework to multi‑step planning horizons.
  3. Explore ensemble or Bayesian approaches to further reduce overconfidence.

Authors

  • Jean Kaddour
  • Srijan Patel
  • Gbètondji Dovonon
  • Leo Richter
  • Pasquale Minervini
  • Matt J. Kusner

Paper Information

ItemDetails
arXiv ID2602.06948v1
Categoriescs.AI, cs.LG
PublishedFebruary 6, 2026
PDFDownload PDF
Back to Blog

Related posts

Read more »

A Guide to Fine-Tuning FunctionGemma

markdown Jan 16, 2026 In the world of Agentic AI, the ability to call tools translates natural language into executable software actions. Last month we released...