[Paper] Agentic Uncertainty Reveals Agentic Overconfidence

Published: 2 months ago (February 6, 2026 at 01:49 PM EST)

5 min read

Source: arXiv

Source: arXiv:2602.06948v1

Overview

The paper Agentic Uncertainty Reveals Agentic Overconfidence investigates whether AI agents can accurately gauge their own chances of success on a given task. By probing agents’ self‑estimated success probabilities at different stages—before they start, while they’re working, and after they finish—the authors uncover a systematic tendency for agents to be overly confident, sometimes by a factor of three.

Surprisingly, the coarse, pre‑execution estimates often separate successful from failing runs better than the detailed post‑execution reviews.

Key Contributions

Formal definition of agentic uncertainty – a framework for eliciting an agent’s own probability of success at multiple execution points.
Empirical evidence of agentic overconfidence across a range of language‑model‑based agents, including cases where actual success rates are as low as 22 % but predicted success exceeds 70 %.
Counter‑intuitive finding: pre‑execution confidence scores (with less information) can provide sharper discrimination between successful and failed attempts than post‑execution scores.
Adversarial prompting technique that reframes the confidence query as a “bug‑finding” task, yielding the best calibration among tested methods.
Comprehensive benchmark covering several standard AI‑agent tasks (code generation, reasoning, planning) and multiple model families (GPT‑3.5, Claude, Llama‑2).

Methodology

Task Suite – The authors selected a diverse set of benchmark tasks (e.g., solving SAT problems, writing Python functions, planning routes). Each task has a clear binary outcome: success or failure.
Confidence Elicitation – For every task instance, the agent is asked three times to output a probability (p \in [0,1]) that it will succeed:
- Pre‑execution – before seeing any input or performing any computation.
- Mid‑execution – after generating an intermediate solution (e.g., a draft code snippet).
- Post‑execution – after producing the final answer and optionally self‑checking.
The probability is obtained via a prompting template that asks the model to “rate your confidence on a 0‑100 scale.”
Calibration Metrics – The authors compute standard calibration curves, Expected Calibration Error (ECE), and Brier scores to compare predicted probabilities against actual outcomes.
Adversarial Prompting – To improve calibration, they introduce a “bug‑finding” prompt:
“Assume your answer may contain errors; how likely is it that a hidden bug exists?”
This forces the model to adopt a more critical stance.
Statistical Analysis – Paired t-tests and bootstrap confidence intervals assess whether differences between pre‑, mid‑, and post‑execution confidence are statistically significant.

Results & Findings

Systematic Overconfidence – Across all models, the average predicted success was 0.58 while the true success rate was 0.34, yielding an ECE of 0.21. The most extreme case: a model succeeded only 22 % of the time but reported a 77 % success probability.
Pre‑execution Beats Post‑execution – In 7 out of 9 task families, the pre‑execution confidence scores produced higher Area‑Under‑Curve (AUC) values for distinguishing success vs. failure than post‑execution scores (average AUC: 0.71 vs. 0.66). The advantage was modest but consistent.
Adversarial Prompting Improves Calibration – The bug‑finding prompt reduced ECE by ~30 % (from 0.21 to 0.15) and lowered the Brier score, indicating tighter alignment between confidence and reality.
Model Size Matters, but Not Linearly – Larger models tended to be slightly better calibrated, yet even the biggest (GPT‑4‑level) exhibited noticeable overconfidence.
Mid‑execution Scores Were Noisy – Because the agent sees only a partial solution, its confidence fluctuated wildly, offering little predictive power.

Practical Implications

Risk‑Aware Deployment – Developers building autonomous agents (e.g., code assistants, planning bots) should not trust raw confidence scores out‑of‑the‑box. Incorporating calibrated uncertainty estimates can prevent costly failures in production.
Safety Nets & Human‑in‑the‑Loop – Systems can trigger human review when the calibrated confidence falls below a safety threshold, or when the adversarial “bug‑finding” confidence spikes.
Prompt Engineering for Better Self‑Assessment – Reframing confidence queries as error‑detection tasks is a cheap, model‑agnostic way to obtain more reliable self‑evaluation without extra training.
Benchmarking Standards – The paper’s methodology can become a standard test suite for future LLM‑based agents, encouraging the community to report both performance and calibrated uncertainty.
Resource Allocation – Since pre‑execution confidence already provides useful discrimination, developers can decide early whether to allocate compute resources (e.g., run a more expensive verification step) based on a cheap confidence check.

Limitations & Future Work

Limitations

Task scope – The study concentrates on relatively short, well‑defined tasks, leaving it unclear how the findings generalize to open‑ended generation (e.g., long‑form writing).
Single‑shot prompting – Only a few prompting templates were examined; richer prompting strategies or few‑shot demonstrations could influence calibration.
Model diversity – Although several popular LLM families were evaluated, newer multimodal or instruction‑tuned models were not included.
Dynamic environments – Real‑time agents that interact with changing environments (e.g., robotics) may exhibit different uncertainty dynamics.

Future Directions

Integrate uncertainty calibration into the training objective.
Extend the framework to multi‑step planning horizons.
Explore ensemble or Bayesian approaches to further reduce overconfidence.

Authors

Jean Kaddour
Srijan Patel
Gbètondji Dovonon
Leo Richter
Pasquale Minervini
Matt J. Kusner

Paper Information

Item	Details
arXiv ID	`2602.06948v1`
Categories	`cs.AI`, `cs.LG`
Published	February 6, 2026
PDF	Download PDF

[Paper] Agentic Uncertainty Reveals Agentic Overconfidence

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Limitations

Future Directions

Authors

Paper Information

Related posts

Hello again, here's a LangChain Ollama helper sheet :)

Anyone can build agents, but it takes a platform to run them

How We Give AI Agents Long-Term Memory Without Blowing the Budget

The missing layer between agent connectivity and true collaboration