[Paper] Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency
Source: arXiv - 2601.05905v1
Overview
The paper “Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency” uncovers a hidden flaw in today’s large language models (LLMs): even when a model appears perfectly confident on a single prompt, its answer can crumble as soon as the surrounding context changes slightly. By introducing a structural metric called Neighbor‑Consistency Belief (NCB) and a stress‑testing protocol that perturbs the context, the authors show how to detect and mitigate this brittleness, and they propose a simple training tweak—Structure‑Aware Training (SAT)—that makes LLMs noticeably more robust.
Key Contributions
- Neighbour‑Consistency Belief (NCB): a new, model‑agnostic metric that measures how consistently a model’s answer holds across a conceptual neighborhood of semantically related prompts.
- Cognitive Stress‑Testing Protocol: a systematic way to inject mild contextual interference (paraphrases, distractor sentences, irrelevant facts) and observe answer stability.
- Empirical Validation: extensive experiments on several state‑of‑the‑art LLMs (GPT‑3.5, LLaMA‑2, Claude, etc.) demonstrating that high‑NCB examples retain correctness far better under stress.
- Structure‑Aware Training (SAT): a lightweight fine‑tuning recipe that explicitly optimises for context‑invariant belief structures, cutting long‑tail knowledge brittleness by ~30 % without sacrificing overall accuracy.
- Open‑Source Release: code, data, and evaluation scripts are made publicly available, enabling reproducibility and community‑driven extensions.
Methodology
- Define a Conceptual Neighborhood – For any factual query Q, the authors generate a set of neighboring prompts by (a) paraphrasing the question, (b) adding unrelated but plausible sentences, and (c) swapping synonyms or ordering of entities.
- Compute Neighbor‑Consistency Belief (NCB) – Run the LLM on each neighbor prompt, collect the answers, and calculate the proportion of responses that agree (exactly or within a tolerance). High NCB means the model’s belief is stable across the neighborhood.
- Cognitive Stress‑Testing – Systematically increase the “stress level” of the context (e.g., more distractors, higher lexical variance) and track how answer accuracy degrades. This reveals whether point‑wise confidence metrics like Self‑Consistency are misleading.
- Structure‑Aware Training (SAT) – During fine‑tuning, the loss function is augmented with a consistency regulariser that penalises divergent answers across neighbor prompts. The model therefore learns a belief representation that is invariant to superficial context changes.
The pipeline is deliberately simple: it works with any black‑box LLM via API calls, needs only a modest amount of additional data (a few hundred neighbor prompts per fact), and can be plugged into existing evaluation suites.
Results & Findings
| Model | Baseline Accuracy (no stress) | Accuracy under high stress | NCB‑High Subset Accuracy (stress) | SAT‑Improved Accuracy (stress) |
|---|---|---|---|---|
| GPT‑3.5‑Turbo | 92 % | 68 % | 84 % | 78 % |
| LLaMA‑2‑13B | 88 % | 61 % | 79 % | 73 % |
| Claude‑Instant | 90 % | 65 % | 82 % | 76 % |
- Self‑Consistency can be deceptive: many queries that achieve 100 % self‑consistency drop below 70 % when a single distractor sentence is added.
- NCB predicts robustness: examples with NCB > 0.9 retain >80 % accuracy even under the harshest stress level, whereas low‑NCB examples fall below 50 %.
- SAT reduces brittleness: across all models, SAT cuts the long‑tail error rate (cases where the answer flips only under stress) by roughly 30 % while keeping overall zero‑shot performance within 1 % of the baseline.
Practical Implications
- Safer AI assistants: Deployments that need factual reliability (e.g., code generation, medical triage, legal drafting) can use NCB as a quick sanity check before presenting an answer to users.
- Dynamic prompting strategies: Developers can automatically generate neighbor prompts at inference time; if NCB falls below a threshold, the system can request clarification, fall back to a retrieval‑augmented pipeline, or flag the response as uncertain.
- Model selection & fine‑tuning: NCB offers a more nuanced benchmark than raw accuracy, helping teams choose models that are not just correct but stable under real‑world conversational noise.
- Cost‑effective robustness: SAT requires only a modest amount of additional fine‑tuning data and can be applied to existing checkpoints, making it attractive for companies that cannot afford massive retraining.
- Tooling integration: The released GitHub repo includes a lightweight Python library that plugs into popular LLM wrappers (OpenAI, Hugging Face Transformers), enabling immediate adoption in CI pipelines or A/B tests.
Limitations & Future Work
- Neighborhood construction is heuristic: The current method relies on rule‑based paraphrasing and distractor insertion, which may miss more subtle context shifts (e.g., cultural idioms, multimodal cues).
- Scalability to very large corpora: Computing NCB for every query in high‑throughput services could add latency; approximate or cached versions need exploration.
- Domain‑specific nuances: The paper focuses mainly on general‑knowledge facts; extending NCB to highly technical domains (e.g., scientific literature, legal statutes) may require domain‑aware neighbor generation.
- Long‑term belief dynamics: The study evaluates static prompts; future work could examine how NCB evolves across multi‑turn dialogues or over time as models are continuously updated.
Overall, the work provides a practical lens for diagnosing “illusion of confidence” in LLMs and offers concrete tools that developers can start using today to make AI systems more trustworthy.
Authors
- Haoming Xu
- Ningyuan Zhao
- Yunzhi Yao
- Weihong Xu
- Hongru Wang
- Xinle Deng
- Shumin Deng
- Jeff Z. Pan
- Huajun Chen
- Ningyu Zhang
Paper Information
- arXiv ID: 2601.05905v1
- Categories: cs.CL, cs.AI, cs.HC, cs.LG, cs.MA
- Published: January 9, 2026
- PDF: Download PDF