[Paper] Evaluating and Calibrating LLM Confidence on Questions with Multiple Correct Answers
Source: arXiv - 2602.07842v1
Overview
The paper investigates a hidden flaw in today’s confidence‑calibration tricks for large language models (LLMs). While most prior work assumes every question has a single correct answer, many real‑world queries admit multiple equally valid answers. The authors show that existing training‑free calibration methods dramatically underestimate confidence in such cases, which can make LLM‑driven systems unreliable.
Key Contributions
- MACE benchmark – a new, publicly released dataset of ~12 k factual questions across six domains, explicitly annotated with the number of correct answers (from 1 up to many).
- Systematic analysis – evaluation of 15 popular calibration techniques on four LLM families (7 B–72 B parameters) reveals a consistent drop in estimated confidence as answer cardinality grows, even though accuracy improves.
- Semantic Confidence Aggregation (SCA) – a simple, training‑free method that samples several high‑probability completions, embeds them, and aggregates their confidence scores to better reflect the true answer space.
- State‑of‑the‑art calibration – SCA outperforms all baselines on mixed‑answer questions while matching or exceeding them on single‑answer queries.
- Open‑source release – code, data, and evaluation scripts are made available for reproducibility and further research.
Methodology
- Benchmark construction (MACE)
- Curated 12 k factual Q&A pairs from six knowledge domains (history, science, geography, etc.).
- Each question is labeled with all correct answers, allowing the authors to count how many valid responses exist (answer cardinality).
- Calibration methods evaluated
- Classic temperature scaling, Dirichlet calibration, ensemble variance, and several recent “training‑free” tricks (e.g., confidence‑based prompting, self‑consistency).
- Applied to four popular LLM families: LLaMA, Falcon, Mistral, and GPT‑style models, ranging from 7 B to 72 B parameters.
- Semantic Confidence Aggregation (SCA)
- Generate k (e.g., 5–10) top‑probability completions via nucleus sampling.
- Encode each completion with a sentence‑level embedding model (e.g., Sentence‑BERT).
- Cluster embeddings to identify semantically distinct answer groups.
- Aggregate the model’s token‑level probabilities within each cluster, then combine clusters using a weighted average that reflects the proportion of high‑confidence answers.
- Evaluation metrics
- Expected Calibration Error (ECE) and Maximum Calibration Error (MCE) across bins of predicted confidence.
- Accuracy and Answer Cardinality‑aware F1 to ensure calibration improvements don’t sacrifice correctness.
Results & Findings
| Model (size) | Calibration method | ECE (single‑answer) | ECE (multi‑answer) | Accuracy ↑ with cardinality |
|---|---|---|---|---|
| LLaMA‑13B | Temperature scaling | 4.2 % | 12.8 % | +6 % |
| Falcon‑40B | Self‑consistency | 3.9 % | 13.5 % | +8 % |
| Mistral‑7B | Dirichlet calib. | 5.1 % | 14.2 % | +5 % |
| GPT‑4‑style | SCA (proposed) | 2.8 % | 4.1 % | +7 % |
Key takeaways
- Accuracy rises as more correct answers become available (the model can hit any of them).
- Confidence drops sharply for the same questions, leading to severe under‑confidence and high ECE.
- SCA reduces ECE by ~60 % on mixed‑answer questions while keeping the best calibration on single‑answer tasks.
- The improvement holds across model families and scales, indicating the issue is methodological rather than model‑specific.
Practical Implications
- Decision‑making systems (e.g., medical triage bots, legal assistants) that rely on LLM confidence scores can make overly cautious choices when a question admits several valid answers, potentially discarding useful information.
- Search‑and‑retrieval pipelines that filter LLM outputs by confidence will miss relevant answers for ambiguous queries unless calibrated with a method like SCA.
- Human‑in‑the‑loop workflows (code review, content moderation) can benefit from more trustworthy confidence estimates, reducing unnecessary manual checks.
- API providers can integrate SCA as a lightweight post‑processing step, improving the quality‑of‑service metrics (e.g., “confidence‑aware latency”).
- Dataset creation: The MACE benchmark offers a ready‑made testbed for any team building calibrated LLM services, especially those handling open‑ended or multi‑label tasks.
Limitations & Future Work
- Sampling cost – SCA requires generating multiple completions per query, which adds latency and compute overhead; optimizing the number of samples vs. calibration gain is an open question.
- Domain coverage – MACE focuses on factual domains; calibration behavior on creative or opinion‑based tasks remains unexplored.
- Embedding bias – The clustering step depends on the quality of the sentence encoder; biased embeddings could skew confidence aggregation.
- Scalability to extremely large models – Experiments stop at 72 B parameters; it is unclear whether the same trends hold for trillion‑parameter systems.
- Future directions suggested by the authors include: (1) learning a lightweight aggregation network that mimics SCA without explicit sampling, (2) extending the benchmark to multilingual and multimodal settings, and (3) investigating how fine‑tuning or instruction‑tuning interacts with multi‑answer calibration.
Authors
- Yuhan Wang
- Shiyu Ni
- Zhikai Ding
- Zihang Zhan
- Yuanzi Li
- Keping Bi
Paper Information
- arXiv ID: 2602.07842v1
- Categories: cs.CL
- Published: February 8, 2026
- PDF: Download PDF