[Paper] The First Token Knows: Single-Decode Confidence for Hallucination Detection

Published: 4 days ago (May 6, 2026 at 01:34 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.05166v1

Overview

Detecting hallucinations—when a language model fabricates facts—usually relies on generating many answer candidates and checking how much they agree. Mina Gabriel’s paper shows that you can get comparable (or even better) confidence signals from just the first content‑bearing token of a single greedy decode, cutting inference cost dramatically while still flagging unreliable answers.

Key Contributions

First‑token confidence metric (ϕ₁ₙₜ): Defined as the normalized entropy of the top‑K logits at the first meaningful token, requiring only one forward pass.
Empirical validation: Across three 7‑8 B instruction‑tuned models and two short‑answer QA benchmarks, ϕ₁ₙₜ attains a mean AUROC of 0.820, surpassing both semantic self‑consistency (0.793) and surface‑form self‑consistency (0.791).
Correlation analysis: Shows moderate‑to‑strong correlation between ϕ₁ₙₜ and multi‑sample semantic agreement, indicating that the first‑token distribution already captures much of the uncertainty.
Baseline recommendation: Proposes reporting ϕ₁ₙₜ as a low‑cost baseline before resorting to expensive sampling‑based uncertainty estimators.

Methodology

Single greedy decode: For each question, the model generates an answer using greedy decoding (no sampling).
Identify first content token: Skip any leading punctuation or stop‑words; the first token that carries semantic weight is selected.
Compute confidence:
- Extract the logits for the top‑K candidate tokens at that position.
- Normalize them into a probability distribution.
- Calculate entropy; lower entropy (i.e., a peaked distribution) yields higher confidence.
- Normalize entropy to obtain ϕ₁ₙₜ ∈ [0,1].
Evaluation: Compare ϕ₁ₙₜ against ground‑truth hallucination labels using AUROC. Baselines include:
- Surface‑form self‑consistency: Agreement among multiple sampled answers measured by exact string overlap.
- Semantic self‑consistency: Agreement measured after clustering answers via a natural‑language‑inference model.

Results & Findings

Metric	AUROC (mean)
ϕ₁ₙₜ (first‑token confidence)	0.820
Semantic self‑consistency	0.793
Surface‑form self‑consistency	0.791

Cost advantage: ϕ₁ₙₜ requires a single forward pass, whereas the self‑consistency baselines need 10‑30 sampled decodes plus an NLI model for the semantic version.
Signal overlap: A subsumption test reveals that most cases flagged by semantic agreement are already captured by ϕ₁ₙₜ; combining both yields only a marginal AUROC bump (~0.02).
Robustness: The advantage holds across different model sizes (7 B vs. 8 B) and two benchmark datasets, suggesting the finding is not dataset‑specific.

Practical Implications

Fast hallucination screening: Deploy ϕ₁ₙₜ as a lightweight “confidence check” before returning an answer in production APIs, saving compute and latency.
Resource‑constrained environments: Edge devices or low‑budget inference servers can still obtain uncertainty estimates without the overhead of sampling or auxiliary NLI models.
Pipeline simplification: Teams can replace multi‑sample consistency modules with a single‑pass confidence score, reducing engineering complexity and maintenance.
Hybrid systems: For high‑stakes queries (e.g., medical or legal), combine ϕ₁ₙₜ with a fallback sampling‑based check only when the first‑token confidence falls below a threshold, achieving a good trade‑off between speed and safety.

Limitations & Future Work

Scope limited to short‑answer factual QA: The study does not evaluate longer generation tasks (e.g., summarization, code generation) where the first token may be less informative.
Model size range: Experiments focus on 7‑8 B instruction‑tuned models; it remains unclear how the metric scales to much larger LLMs or smaller distilled models.
Tokenization effects: Different tokenizers could shift where the “first content token” appears, potentially affecting confidence calculations.
Future directions:
- Extend ϕ₁ₙₜ to multi‑turn dialogues and open‑ended generation.
- Investigate adaptive K‑selection or entropy smoothing to improve robustness across tokenizers.
- Explore integration with calibration techniques to turn ϕ₁ₙₜ into well‑calibrated probability estimates.

Bottom line: If you need a quick, cheap sanity check on whether a model’s answer might be hallucinating, start with the entropy of its first meaningful token. It’s often enough, and it saves you the cost of sampling thousands of alternatives.

Authors

Mina Gabriel

Paper Information

arXiv ID: 2605.05166v1
Categories: cs.CL, cs.AI
Published: May 6, 2026
PDF: Download PDF

[Paper] The First Token Knows: Single-Decode Confidence for Hallucination Detection

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

[Paper] CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation

[Paper] Fast Byte Latent Transformer

[Paper] Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims