[Paper] Agentic Uncertainty Reveals Agentic Overconfidence
Source: arXiv:2602.06948v1
Overview
The paper Agentic Uncertainty Reveals Agentic Overconfidence investigates whether AI agents can accurately gauge their own chances of success on a given task. By probing agents’ self‑estimated success probabilities at different stages—before they start, while they’re working, and after they finish—the authors uncover a systematic tendency for agents to be overly confident, sometimes by a factor of three.
Surprisingly, the coarse, pre‑execution estimates often separate successful from failing runs better than the detailed post‑execution reviews.
Key Contributions
- Formal definition of agentic uncertainty – a framework for eliciting an agent’s own probability of success at multiple execution points.
- Empirical evidence of agentic overconfidence across a range of language‑model‑based agents, including cases where actual success rates are as low as 22 % but predicted success exceeds 70 %.
- Counter‑intuitive finding: pre‑execution confidence scores (with less information) can provide sharper discrimination between successful and failed attempts than post‑execution scores.
- Adversarial prompting technique that reframes the confidence query as a “bug‑finding” task, yielding the best calibration among tested methods.
- Comprehensive benchmark covering several standard AI‑agent tasks (code generation, reasoning, planning) and multiple model families (GPT‑3.5, Claude, Llama‑2).
Methodology
-
Task Suite – The authors selected a diverse set of benchmark tasks (e.g., solving SAT problems, writing Python functions, planning routes). Each task has a clear binary outcome: success or failure.
-
Confidence Elicitation – For every task instance, the agent is asked three times to output a probability (p \in [0,1]) that it will succeed:
- Pre‑execution – before seeing any input or performing any computation.
- Mid‑execution – after generating an intermediate solution (e.g., a draft code snippet).
- Post‑execution – after producing the final answer and optionally self‑checking.
The probability is obtained via a prompting template that asks the model to “rate your confidence on a 0‑100 scale.”
-
Calibration Metrics – The authors compute standard calibration curves, Expected Calibration Error (ECE), and Brier scores to compare predicted probabilities against actual outcomes.
-
Adversarial Prompting – To improve calibration, they introduce a “bug‑finding” prompt:
“Assume your answer may contain errors; how likely is it that a hidden bug exists?”
This forces the model to adopt a more critical stance.
-
Statistical Analysis – Paired t-tests and bootstrap confidence intervals assess whether differences between pre‑, mid‑, and post‑execution confidence are statistically significant.
Results & Findings
-
Systematic Overconfidence – Across all models, the average predicted success was 0.58 while the true success rate was 0.34, yielding an ECE of 0.21. The most extreme case: a model succeeded only 22 % of the time but reported a 77 % success probability.
-
Pre‑execution Beats Post‑execution – In 7 out of 9 task families, the pre‑execution confidence scores produced higher Area‑Under‑Curve (AUC) values for distinguishing success vs. failure than post‑execution scores (average AUC: 0.71 vs. 0.66). The advantage was modest but consistent.
-
Adversarial Prompting Improves Calibration – The bug‑finding prompt reduced ECE by ~30 % (from 0.21 to 0.15) and lowered the Brier score, indicating tighter alignment between confidence and reality.
-
Model Size Matters, but Not Linearly – Larger models tended to be slightly better calibrated, yet even the biggest (GPT‑4‑level) exhibited noticeable overconfidence.
-
Mid‑execution Scores Were Noisy – Because the agent sees only a partial solution, its confidence fluctuated wildly, offering little predictive power.
Practical Implications
-
Risk‑Aware Deployment – Developers building autonomous agents (e.g., code assistants, planning bots) should not trust raw confidence scores out‑of‑the‑box. Incorporating calibrated uncertainty estimates can prevent costly failures in production.
-
Safety Nets & Human‑in‑the‑Loop – Systems can trigger human review when the calibrated confidence falls below a safety threshold, or when the adversarial “bug‑finding” confidence spikes.
-
Prompt Engineering for Better Self‑Assessment – Reframing confidence queries as error‑detection tasks is a cheap, model‑agnostic way to obtain more reliable self‑evaluation without extra training.
-
Benchmarking Standards – The paper’s methodology can become a standard test suite for future LLM‑based agents, encouraging the community to report both performance and calibrated uncertainty.
-
Resource Allocation – Since pre‑execution confidence already provides useful discrimination, developers can decide early whether to allocate compute resources (e.g., run a more expensive verification step) based on a cheap confidence check.
Limitations & Future Work
Limitations
- Task scope – The study concentrates on relatively short, well‑defined tasks, leaving it unclear how the findings generalize to open‑ended generation (e.g., long‑form writing).
- Single‑shot prompting – Only a few prompting templates were examined; richer prompting strategies or few‑shot demonstrations could influence calibration.
- Model diversity – Although several popular LLM families were evaluated, newer multimodal or instruction‑tuned models were not included.
- Dynamic environments – Real‑time agents that interact with changing environments (e.g., robotics) may exhibit different uncertainty dynamics.
Future Directions
- Integrate uncertainty calibration into the training objective.
- Extend the framework to multi‑step planning horizons.
- Explore ensemble or Bayesian approaches to further reduce overconfidence.
Authors
- Jean Kaddour
- Srijan Patel
- Gbètondji Dovonon
- Leo Richter
- Pasquale Minervini
- Matt J. Kusner
Paper Information
| Item | Details |
|---|---|
| arXiv ID | 2602.06948v1 |
| Categories | cs.AI, cs.LG |
| Published | February 6, 2026 |
| Download PDF |