[Paper] Evaluating and Calibrating LLM Confidence on Questions with Multiple Correct Answers

Published: 3 days ago (February 8, 2026 at 02:03 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.07842v1

Overview

The paper investigates a hidden flaw in today’s confidence‑calibration tricks for large language models (LLMs). While most prior work assumes every question has a single correct answer, many real‑world queries admit multiple equally valid answers. The authors show that existing training‑free calibration methods dramatically underestimate confidence in such cases, which can make LLM‑driven systems unreliable.

Key Contributions

MACE benchmark – a new, publicly released dataset of ~12 k factual questions across six domains, explicitly annotated with the number of correct answers (from 1 up to many).
Systematic analysis – evaluation of 15 popular calibration techniques on four LLM families (7 B–72 B parameters) reveals a consistent drop in estimated confidence as answer cardinality grows, even though accuracy improves.
Semantic Confidence Aggregation (SCA) – a simple, training‑free method that samples several high‑probability completions, embeds them, and aggregates their confidence scores to better reflect the true answer space.
State‑of‑the‑art calibration – SCA outperforms all baselines on mixed‑answer questions while matching or exceeding them on single‑answer queries.
Open‑source release – code, data, and evaluation scripts are made available for reproducibility and further research.

Methodology

Benchmark construction (MACE)
- Curated 12 k factual Q&A pairs from six knowledge domains (history, science, geography, etc.).
- Each question is labeled with all correct answers, allowing the authors to count how many valid responses exist (answer cardinality).
Calibration methods evaluated
- Classic temperature scaling, Dirichlet calibration, ensemble variance, and several recent “training‑free” tricks (e.g., confidence‑based prompting, self‑consistency).
- Applied to four popular LLM families: LLaMA, Falcon, Mistral, and GPT‑style models, ranging from 7 B to 72 B parameters.
Semantic Confidence Aggregation (SCA)
- Generate k (e.g., 5–10) top‑probability completions via nucleus sampling.
- Encode each completion with a sentence‑level embedding model (e.g., Sentence‑BERT).
- Cluster embeddings to identify semantically distinct answer groups.
- Aggregate the model’s token‑level probabilities within each cluster, then combine clusters using a weighted average that reflects the proportion of high‑confidence answers.
Evaluation metrics
- Expected Calibration Error (ECE) and Maximum Calibration Error (MCE) across bins of predicted confidence.
- Accuracy and Answer Cardinality‑aware F1 to ensure calibration improvements don’t sacrifice correctness.

Results & Findings

Model (size)	Calibration method	ECE (single‑answer)	ECE (multi‑answer)	Accuracy ↑ with cardinality
LLaMA‑13B	Temperature scaling	4.2 %	12.8 %	+6 %
Falcon‑40B	Self‑consistency	3.9 %	13.5 %	+8 %
Mistral‑7B	Dirichlet calib.	5.1 %	14.2 %	+5 %
GPT‑4‑style	SCA (proposed)	2.8 %	4.1 %	+7 %

Key takeaways

Accuracy rises as more correct answers become available (the model can hit any of them).
Confidence drops sharply for the same questions, leading to severe under‑confidence and high ECE.
SCA reduces ECE by ~60 % on mixed‑answer questions while keeping the best calibration on single‑answer tasks.
The improvement holds across model families and scales, indicating the issue is methodological rather than model‑specific.

Practical Implications

Decision‑making systems (e.g., medical triage bots, legal assistants) that rely on LLM confidence scores can make overly cautious choices when a question admits several valid answers, potentially discarding useful information.
Search‑and‑retrieval pipelines that filter LLM outputs by confidence will miss relevant answers for ambiguous queries unless calibrated with a method like SCA.
Human‑in‑the‑loop workflows (code review, content moderation) can benefit from more trustworthy confidence estimates, reducing unnecessary manual checks.
API providers can integrate SCA as a lightweight post‑processing step, improving the quality‑of‑service metrics (e.g., “confidence‑aware latency”).
Dataset creation: The MACE benchmark offers a ready‑made testbed for any team building calibrated LLM services, especially those handling open‑ended or multi‑label tasks.

Limitations & Future Work

Sampling cost – SCA requires generating multiple completions per query, which adds latency and compute overhead; optimizing the number of samples vs. calibration gain is an open question.
Domain coverage – MACE focuses on factual domains; calibration behavior on creative or opinion‑based tasks remains unexplored.
Embedding bias – The clustering step depends on the quality of the sentence encoder; biased embeddings could skew confidence aggregation.
Scalability to extremely large models – Experiments stop at 72 B parameters; it is unclear whether the same trends hold for trillion‑parameter systems.
Future directions suggested by the authors include: (1) learning a lightweight aggregation network that mimics SCA without explicit sampling, (2) extending the benchmark to multilingual and multimodal settings, and (3) investigating how fine‑tuning or instruction‑tuning interacts with multi‑answer calibration.

Authors

Yuhan Wang
Shiyu Ni
Zhikai Ding
Zihang Zhan
Yuanzi Li
Keping Bi

Paper Information

arXiv ID: 2602.07842v1
Categories: cs.CL
Published: February 8, 2026
PDF: Download PDF

[Paper] Evaluating and Calibrating LLM Confidence on Questions with Multiple Correct Answers

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Quantum-Audit: Evaluating the Reasoning Limits of LLMs on Quantum Computing

[Paper] Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

[Paper] Anagent For Enhancing Scientific Table & Figure Analysis

[Paper] CAPID: Context-Aware PII Detection for Question-Answering Systems

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Quantum-Audit: Evaluating the Reasoning Limits of LLMs on Quantum Computing

[Paper] Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

[Paper] Anagent For Enhancing Scientific Table &amp; Figure Analysis

[Paper] CAPID: Context-Aware PII Detection for Question-Answering Systems

[Paper] Anagent For Enhancing Scientific Table & Figure Analysis