[Paper] Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency

Published: 1 month ago (January 9, 2026 at 11:23 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.05905v1

Overview

The paper “Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency” uncovers a hidden flaw in today’s large language models (LLMs): even when a model appears perfectly confident on a single prompt, its answer can crumble as soon as the surrounding context changes slightly. By introducing a structural metric called Neighbor‑Consistency Belief (NCB) and a stress‑testing protocol that perturbs the context, the authors show how to detect and mitigate this brittleness, and they propose a simple training tweak—Structure‑Aware Training (SAT)—that makes LLMs noticeably more robust.

Key Contributions

Neighbour‑Consistency Belief (NCB): a new, model‑agnostic metric that measures how consistently a model’s answer holds across a conceptual neighborhood of semantically related prompts.
Cognitive Stress‑Testing Protocol: a systematic way to inject mild contextual interference (paraphrases, distractor sentences, irrelevant facts) and observe answer stability.
Empirical Validation: extensive experiments on several state‑of‑the‑art LLMs (GPT‑3.5, LLaMA‑2, Claude, etc.) demonstrating that high‑NCB examples retain correctness far better under stress.
Structure‑Aware Training (SAT): a lightweight fine‑tuning recipe that explicitly optimises for context‑invariant belief structures, cutting long‑tail knowledge brittleness by ~30 % without sacrificing overall accuracy.
Open‑Source Release: code, data, and evaluation scripts are made publicly available, enabling reproducibility and community‑driven extensions.

Methodology

Define a Conceptual Neighborhood – For any factual query Q, the authors generate a set of neighboring prompts by (a) paraphrasing the question, (b) adding unrelated but plausible sentences, and (c) swapping synonyms or ordering of entities.
Compute Neighbor‑Consistency Belief (NCB) – Run the LLM on each neighbor prompt, collect the answers, and calculate the proportion of responses that agree (exactly or within a tolerance). High NCB means the model’s belief is stable across the neighborhood.
Cognitive Stress‑Testing – Systematically increase the “stress level” of the context (e.g., more distractors, higher lexical variance) and track how answer accuracy degrades. This reveals whether point‑wise confidence metrics like Self‑Consistency are misleading.
Structure‑Aware Training (SAT) – During fine‑tuning, the loss function is augmented with a consistency regulariser that penalises divergent answers across neighbor prompts. The model therefore learns a belief representation that is invariant to superficial context changes.

The pipeline is deliberately simple: it works with any black‑box LLM via API calls, needs only a modest amount of additional data (a few hundred neighbor prompts per fact), and can be plugged into existing evaluation suites.

Results & Findings

Model	Baseline Accuracy (no stress)	Accuracy under high stress	NCB‑High Subset Accuracy (stress)	SAT‑Improved Accuracy (stress)
GPT‑3.5‑Turbo	92 %	68 %	84 %	78 %
LLaMA‑2‑13B	88 %	61 %	79 %	73 %
Claude‑Instant	90 %	65 %	82 %	76 %

Self‑Consistency can be deceptive: many queries that achieve 100 % self‑consistency drop below 70 % when a single distractor sentence is added.
NCB predicts robustness: examples with NCB > 0.9 retain >80 % accuracy even under the harshest stress level, whereas low‑NCB examples fall below 50 %.
SAT reduces brittleness: across all models, SAT cuts the long‑tail error rate (cases where the answer flips only under stress) by roughly 30 % while keeping overall zero‑shot performance within 1 % of the baseline.

Practical Implications

Safer AI assistants: Deployments that need factual reliability (e.g., code generation, medical triage, legal drafting) can use NCB as a quick sanity check before presenting an answer to users.
Dynamic prompting strategies: Developers can automatically generate neighbor prompts at inference time; if NCB falls below a threshold, the system can request clarification, fall back to a retrieval‑augmented pipeline, or flag the response as uncertain.
Model selection & fine‑tuning: NCB offers a more nuanced benchmark than raw accuracy, helping teams choose models that are not just correct but stable under real‑world conversational noise.
Cost‑effective robustness: SAT requires only a modest amount of additional fine‑tuning data and can be applied to existing checkpoints, making it attractive for companies that cannot afford massive retraining.
Tooling integration: The released GitHub repo includes a lightweight Python library that plugs into popular LLM wrappers (OpenAI, Hugging Face Transformers), enabling immediate adoption in CI pipelines or A/B tests.

Limitations & Future Work

Neighborhood construction is heuristic: The current method relies on rule‑based paraphrasing and distractor insertion, which may miss more subtle context shifts (e.g., cultural idioms, multimodal cues).
Scalability to very large corpora: Computing NCB for every query in high‑throughput services could add latency; approximate or cached versions need exploration.
Domain‑specific nuances: The paper focuses mainly on general‑knowledge facts; extending NCB to highly technical domains (e.g., scientific literature, legal statutes) may require domain‑aware neighbor generation.
Long‑term belief dynamics: The study evaluates static prompts; future work could examine how NCB evolves across multi‑turn dialogues or over time as models are continuously updated.

Overall, the work provides a practical lens for diagnosing “illusion of confidence” in LLMs and offers concrete tools that developers can start using today to make AI systems more trustworthy.

Authors

Haoming Xu
Ningyuan Zhao
Yunzhi Yao
Weihong Xu
Hongru Wang
Xinle Deng
Shumin Deng
Jeff Z. Pan
Huajun Chen
Ningyu Zhang

Paper Information

arXiv ID: 2601.05905v1
Categories: cs.CL, cs.AI, cs.HC, cs.LG, cs.MA
Published: January 9, 2026
PDF: Download PDF

[Paper] Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs

[Paper] The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning

[Paper] Can We Predict Before Executing Machine Learning Agents?

[Paper] An Empirical Study on Preference Tuning Generalization and Diversity Under Domain Shift