A better method for identifying overconfident large language models

Published: 1 day ago (March 19, 2026 at 12:00 AM EDT)

5 min read

Source: MIT News - AI

Overview

Large language models (LLMs) can generate credible but inaccurate responses, so researchers have developed uncertainty quantification methods to check the reliability of predictions. One popular method involves submitting the same prompt multiple times to see if the model generates the same answer.

But this method measures self‑confidence, and even the most impressive LLM might be confidently wrong. Overconfidence can mislead users about the accuracy of a prediction, which might result in devastating consequences in high‑stakes settings like health care or finance.

To address this shortcoming, MIT researchers introduced a new method for measuring a different type of uncertainty that more reliably identifies confident but incorrect LLM responses.

What the researchers did

Their method compares a target model’s response to responses from a group of similar LLMs.
They found that cross‑model disagreement captures this type of uncertainty more accurately than traditional approaches.
By combining this with a measure of LLM self‑consistency, they created a total uncertainty (TU) metric.
The TU metric was evaluated on 10 realistic tasks (question‑answering, math reasoning, summarization, translation, etc.) and consistently outperformed other measures at identifying unreliable predictions.

“Self‑consistency is being used in a lot of different approaches for uncertainty quantification, but if your estimate of uncertainty only relies on a single model’s outcome, it is not necessarily trustable. We went back to the beginning to understand the limitations of current approaches and used those as a starting point to design a complementary method that can empirically improve the results,”
— Kimia Hamidieh, EECS graduate student at MIT and lead author of the paper on this technique.

Co‑authors:

Veronika Thost – Research scientist, MIT‑IBM Watson AI Lab
Walter Gerych – Former MIT postdoc, now assistant professor at Worcester Polytechnic Institute
Mikhail Yurochkin – Staff research scientist, MIT‑IBM Watson AI Lab
Marzyeh Ghassemi – Associate professor, EECS; member of the Institute of Medical Engineering Sciences and the Laboratory for Information and Decision Systems

Understanding overconfidence

Many popular methods for uncertainty quantification involve:

Asking a model for a confidence score
Testing the consistency of its responses to the same prompt

These methods estimate aleatoric uncertainty – how internally confident a model is in its own prediction.

However, LLMs can be confidently wrong. Research shows that epistemic uncertainty – uncertainty about whether we are using the right model – can be a better indicator of true uncertainty when a model is overconfident.

The MIT team estimates epistemic uncertainty by measuring disagreement across a similar group of LLMs.

“If I ask ChatGPT the same question multiple times and it gives me the same answer over and over again, that doesn’t mean the answer is necessarily correct. If I switch to Claude or Gemini and ask them the same question, and I get a different answer, that is going to give me a sense of the epistemic uncertainty,”
— Kimia Hamidieh

Epistemic uncertainty attempts to capture how far a target model diverges from the ideal model for a task. Since an ideal model is unattainable, researchers use surrogates that often rely on faulty assumptions. The MIT team needed a more accurate way to estimate epistemic uncertainty.

An ensemble approach

Ensemble construction – Measure the divergence between the target model and a small ensemble of models with similar size and architecture.
Semantic similarity – Compare the meaning of the responses rather than exact wording; this provides a better estimate of epistemic uncertainty.
Model diversity – Choose models trained by different companies to ensure diverse responses and avoid excessive similarity to the target model.

“We found that the easiest way to satisfy all these properties is to take models that are trained by different companies. We tried many different approaches that were more complex, but this very simple approach ended up working best.”
— Kimia Hamidieh

Combining uncertainties

Aleatoric uncertainty – Standard self‑consistency measure.
Epistemic uncertainty – Cross‑model disagreement measure.

The total uncertainty (TU) metric = aleatoric + epistemic.

“Uncertainty depends on the uncertainty of the given prompt as well as how close our model is to the optimal model. This is why summing up these two uncertainty metrics is going to give us the best estimate.”
— Kimia Hamidieh

Benefits of TU

Flags confidently wrong outputs (hallucinations) that aleatoric uncertainty alone may miss.
Enables reinforcement of confidently correct answers during training, potentially improving performance.
Often requires fewer queries than calculating aleatoric uncertainty alone → lower computational cost and energy usage.

Experimental results

Task type	TU performance vs. individual metrics
Factual QA (unique correct answer)	Best – high detection of unreliable predictions
Open‑ended tasks (e.g., summarization)	Epistemic component less effective, but TU still outperforms single metrics
Math reasoning, translation, etc.	Consistently superior to either aleatoric or epistemic alone

The experiments also showed that epistemic uncertainty shines on tasks with a single correct answer, while it may underperform on more open‑ended tasks.

Future directions

Adaptation to open‑ended tasks – Refine the epistemic component to better handle multiple plausible answers.
Dynamic ensemble weighting – Use credibility scores to weight each model’s contribution to the disagreement measure.
Integration with training pipelines – Leverage TU to selectively reinforce correct predictions and suppress hallucinations during fine‑tuning.

open-ended queries. They may also build on this work by exploring other forms of aleatoric uncertainty.

This work is funded, in part, by the MIT‑IBM Watson AI Lab.

A better method for identifying overconfident large language models

Overview

What the researchers did

Understanding overconfidence

An ensemble approach

Combining uncertainties

Benefits of TU

Experimental results

Future directions

Related posts

AI Vocab 101

Models Self-Censor When Policy Gates Exist

Building an AI That Watches Itself Die (Part 1 of 4): The Architecture

How we monitor internal coding agents for misalignment