[Paper] Logical Consistency as a Bridge: Improving LLM Hallucination Detection via Label Constraint Modeling between Responses and Self-Judgments
Source: arXiv - 2605.03971v1
Overview
Large Language Models (LLMs) can generate impressively fluent text, but they sometimes “hallucinate” – i.e., produce statements that sound plausible yet are factually wrong. Detecting these hallucinations is crucial for any production‑grade AI system. The new LaaB framework (Logical Consistency‑as‑a‑Bridge) shows how to combine two complementary signals – the model’s internal uncertainty and its own self‑judgment – into a single, more reliable detector.
Key Contributions
- Dual‑view detection: Introduces a unified architecture that jointly leverages neural uncertainty features and symbolic self‑judgments (e.g., “Is this answer correct?”).
- Meta‑judgment mapping: Proposes a “meta‑judgment” step that translates the symbolic label back into the feature space, creating a logical bridge between the two views.
- Mutual learning scheme: Implements a bidirectional consistency loss that forces the response‑side and meta‑judgment side to agree (or intentionally disagree) based on the semantics of the self‑judgment.
- Broad empirical validation: Evaluates LaaB on four public hallucination benchmarks, four different LLM backbones, and against eight strong baselines, consistently outperforming them.
- Open‑source potential: The design is model‑agnostic, making it straightforward to plug into existing LLM pipelines.
Methodology
- Generate response & self‑judgment – For a given query, the LLM first produces an answer, then is prompted to evaluate its own answer (e.g., “Is the answer correct? Yes/No”).
- Extract neural features – The hidden states from the answer generation step are fed into a lightweight classifier that predicts hallucination probability (the response‑view).
- Create meta‑judgment – The self‑judgment label (“Yes”/“No”) is encoded as a symbolic token and passed through a small embedding layer to produce a meta‑judgment feature vector that lives in the same space as the neural features.
- Logical bridge & consistency loss – Because a “Yes” self‑judgment should align with a non‑hallucinated answer, LaaB enforces that the response‑view and meta‑judgment vectors are either identical (for “Yes”) or opposite (for “No”). This is realized with a contrastive loss that pulls matching pairs together and pushes mismatched pairs apart.
- Joint training – The response classifier and the meta‑judgment encoder are trained together, allowing each to improve the other via the consistency signal.
- Inference – At test time, the final hallucination score is a weighted blend of the response‑view probability and the meta‑judgment consistency score.
Results & Findings
| Dataset / Model | Baseline Avg. F1 | LaaB F1 (↑) |
|---|---|---|
| TruthfulQA (GPT‑3.5) | 71.2 | 78.9 (+7.7) |
| HaluEval (LLaMA‑2) | 68.5 | 76.3 (+7.8) |
| WikiFact (Claude) | 73.0 | 80.5 (+7.5) |
| OpenFact (Mistral) | 69.8 | 77.1 (+7.3) |
- LaaB consistently beats the best single‑view detectors (uncertainty‑only or self‑judgment‑only) by 5–9 % absolute F1.
- Ablation studies show that removing the meta‑judgment bridge drops performance by ~6 %, confirming its central role.
- The mutual learning loss improves calibration: predicted probabilities align more closely with actual hallucination rates, reducing over‑confident false positives.
Practical Implications
- Safer AI assistants: Integrating LaaB into chatbots or code‑generation tools can flag dubious answers before they reach users, enabling fallback strategies (e.g., ask for clarification or cite sources).
- Content moderation pipelines: Automated fact‑checking services can use LaaB as a pre‑filter to prioritize human review of high‑risk outputs.
- Model‑agnostic deployment: Because LaaB only requires a short self‑judgment prompt and a lightweight classifier, it can be added on top of any existing LLM API without retraining the base model.
- Reduced hallucination cost: Early detection means fewer expensive post‑hoc verification steps (like external knowledge retrieval), saving compute and latency in production systems.
Limitations & Future Work
- Dependence on self‑judgment quality: If the LLM’s self‑assessment is itself unreliable (e.g., on highly specialized domains), the bridge may propagate errors.
- Prompt sensitivity: The phrasing of the self‑judgment prompt can affect the label distribution; more robust prompt engineering is needed.
- Scalability to multi‑turn dialogs: Current experiments focus on single‑turn Q&A; extending LaaB to maintain logical consistency across conversational histories remains open.
- Broader symbolic signals: Future work could incorporate additional symbolic cues (e.g., citation checks, logical entailment) to further enrich the bridge.
Authors
- Hao Mi
- Qiang Sheng
- Shaofei Wang
- Beizhe Hu
- Yifan Sun
- Zhengjia Wang
- Hengqi Zeng
- Yang Li
- Danding Wang
- Juan Cao
Paper Information
- arXiv ID: 2605.03971v1
- Categories: cs.CL
- Published: May 5, 2026
- PDF: Download PDF