[Paper] Can LLMs Use Linguistic Uncertainty Markers to Reliably Reflect Intrinsic Confidence?
Source: arXiv - 2605.28778v1
Overview
Large language models (LLMs) often qualify their statements with phrases like “it is likely” or “I’m pretty sure.” For these epistemic markers to be useful, the confidence they convey must line up with the model’s actual uncertainty. This paper asks: can LLMs reliably map specific linguistic markers to internal confidence levels, and how stable is that mapping across tasks and data distributions?
Key Contributions
- Formal definition of Marker Internal Confidence (MIC): a quantitative measure of the confidence a model implicitly assigns to a given epistemic marker in a particular task.
- Seven stability metrics: tools to assess whether MIC values stay consistent within a dataset, across similar datasets, and across different model families.
- Comprehensive empirical sweep: evaluated 8 popular LLMs (including GPT‑3.5, LLaMA‑2, and Claude) on three downstream tasks (question answering, fact verification, and commonsense reasoning) and multiple data splits.
- Evidence of systematic mis‑calibration: even when models are forced to interpret markers model‑centrically (i.e., using the model’s own learned semantics), they fail to differentiate confidence levels reliably.
- Insight into ranking stability: while absolute MIC values drift, the relative ordering of markers (e.g., “probably” > “possibly”) remains roughly preserved across tasks.
Methodology
-
Marker‑conditioned prompting: For each task, the authors generated multiple prompts that forced the model to answer using a pre‑specified epistemic marker (e.g., “It is likely that …”).
-
Ground‑truth confidence extraction: The true probability of the answer being correct was estimated via Monte‑Carlo sampling (e.g., multiple answer draws, ensemble voting, or external oracle checks).
-
MIC computation: For each marker‑prompt pair,
$$\text{MIC} = \frac{1}{N}\sum_{i=1}^{N} \text{Ground‑truth correctness probability}_i$$
where (N) is the number of examples where that marker was used.
-
Stability analysis: The seven metrics capture (a) intra‑distribution variance (same data, different random seeds), (b) inter‑distribution variance (different but related datasets), and (c) cross‑model variance (different LLM architectures).
-
Baseline comparison: Randomly assigned markers and a calibrated soft‑max confidence baseline were used to contextualize the results.
Results & Findings
| Model | Task | MIC range across markers | Ranking consistency (Kendall τ) |
|---|---|---|---|
| GPT‑3.5 | QA | 0.58 – 0.62 | 0.71 |
| LLaMA‑2‑13B | Fact verification | 0.55 – 0.57 | 0.68 |
| Claude‑2 | Commonsense | 0.60 – 0.63 | 0.73 |
- Miscalibration persists: The absolute MIC values are tightly clustered (≈ 0.55–0.63) regardless of whether the model says “likely” or “possibly,” indicating the model does not internally adjust its confidence.
- Ranking holds: The order of markers (e.g., “certainly” > “likely” > “possibly”) is stable (τ ≈ 0.7) across tasks, suggesting the model has learned a relative hierarchy of markers even if the numeric confidence is off.
- Cross‑distribution drift: When moving from a news‑article QA set to a biomedical QA set, MICs shift by up to ±0.04, showing poor generalization of marker‑confidence mapping.
- Model size matters little: Scaling from 7B to 70B parameters did not substantially improve MIC stability, hinting that the issue is architectural rather than purely data‑driven.
Practical Implications
- User‑facing AI assistants: Relying on LLM‑generated confidence phrases (e.g., “I’m pretty sure”) can give a false sense of reliability. Developers should treat these markers as qualitative cues, not quantitative guarantees.
- Risk‑aware pipelines: For high‑stakes applications (medical triage, legal advice), augment LLM outputs with external calibration layers (e.g., temperature scaling, Bayesian post‑hoc estimators) instead of trusting the model’s own linguistic markers.
- Prompt engineering: Explicitly requesting a confidence score (numeric) alongside the marker yields more actionable information than marker‑only prompts.
- Evaluation dashboards: The seven MIC stability metrics can be integrated into model monitoring tools to flag when a model’s confidence language diverges from expected behavior across data shifts.
Limitations & Future Work
- Scope of markers: The study focused on a curated set of 8 common epistemic markers; rarer or domain‑specific phrasing may behave differently.
- Task diversity: Only three task families were examined; extending to generation‑heavy tasks (e.g., code synthesis) could reveal new patterns.
- Ground‑truth confidence estimation: Monte‑Carlo sampling approximates true correctness probability but may be noisy for low‑resource domains.
- Alignment interventions: Future research should explore training objectives that directly tie marker usage to calibrated confidence scores (e.g., contrastive loss on marker‑conditioned outputs).
Bottom line: While LLMs have learned a consistent hierarchy of confidence‑expressing words, they remain poorly calibrated in mapping those words to actual certainty. Developers building trustworthy AI systems should not assume that “likely” or “possible” reflects a model’s true confidence and should instead incorporate robust calibration mechanisms.
Authors
- Gabrielle Kaili‑May Liu
- Arman Cohan
Paper Information
- arXiv ID: 2605.28778v1
- Categories: cs.CL
- Published: May 27, 2026
- PDF: Download PDF