Teaching AI models to say “I’m not sure”
Source: MIT News - AI
Confidence is persuasive. In artificial intelligence systems, it is often misleading.
Background
Today’s most capable reasoning models share a trait with the loudest voice in the room: they deliver every answer with the same unshakable certainty, whether they’re right or guessing. Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have traced that overconfidence to a specific flaw in how these models are trained.
RLCR Method
The technique, called RLCR (Reinforcement Learning with Calibration Rewards), trains language models to produce calibrated confidence estimates alongside their answers. In addition to generating an answer, the model also outputs a confidence score that reflects its uncertainty.
During training, a single term is added to the reward function: a Brier score, a well‑established measure that penalizes the gap between a model’s stated confidence and its actual accuracy. This encourages the model to:
- Penalize confidently wrong answers.
- Penalize unnecessarily uncertain correct answers.
The math backs it up: the team proved formally that this reward structure guarantees models that are both accurate and well‑calibrated.
Results
The researchers tested RLCR on a 7‑billion‑parameter model across a range of question‑answering and math benchmarks, including six datasets the model had never seen during training.
Key findings:
- Calibration error was reduced by up to 90 % while maintaining or improving accuracy.
- Standard RL training was shown to degrade calibration compared to the base model; RLCR reversed this effect.
- RLCR outperformed post‑hoc approaches that train a separate classifier to assign confidence scores after the fact.
- When generating multiple candidate answers, selecting the one with the highest self‑reported confidence—or weighting votes by confidence in a majority‑voting scheme—improved both accuracy and calibration as compute scaled.
- Including the model’s explicit uncertainty reasoning as input to downstream classifiers improved their performance, especially for smaller models.
Implications
Overconfidence has serious consequences when AI systems are deployed in high‑stakes domains such as medicine, law, or finance. A model that says “I’m 95 % sure” while being correct only half the time can be more dangerous than a model that simply gives a wrong answer, because users lack a signal to seek a second opinion.
By providing calibrated confidence estimates, RLCR gives users a reliable indicator of when to trust the model’s output and when to seek additional verification.
Authors
- Mehul Damani (MIT PhD student, co‑lead author)
- Isha Puri (MIT PhD student, co‑lead author)
- Stewart Slocum
- Idan Shenfeld
- Leshem Choshen
- Jacob Andreas (senior author)
- Yoon Kim (senior author)
The paper is available on arXiv: https://arxiv.org/abs/2507.16806 and will be presented at the International Conference on Learning Representations later this month.