[Paper] Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction
Source: arXiv - 2605.05134v1
Overview
Large Language Models (LLMs) are great at producing fluent text, but they often slip into “hallucinations” – statements that sound plausible yet are factually wrong. The paper Low‑Cost Black‑Box Detection of LLM Hallucinations via Dynamical System Prediction introduces a novel way to spot these errors without the heavy compute or external knowledge bases that most existing detectors need. By treating an LLM as a black‑box dynamical system and applying concepts from Koopman operator theory, the authors achieve state‑of‑the‑art detection with a single forward pass.
Key Contributions
- Black‑box dynamical‑system view: Re‑frames LLM output sequences as trajectories in a high‑dimensional latent state space, sidestepping the need to peek inside the model.
- Koopman‑based transition modeling: Learns linear operators that approximate the evolution of factual vs. hallucinated response trajectories, enabling a cheap prediction‑error score.
- Differential residual score: Computes the mismatch between observed token embeddings and the two regime‑specific Koopman predictions, yielding a robust hallucination indicator.
- Preference‑aware calibration: Introduces a lightweight, demonstration‑driven threshold‑tuning step that lets users bias the detector toward higher precision or recall depending on domain risk.
- Empirical validation: Shows competitive or superior performance on three benchmark datasets while cutting inference cost by up to 70 % compared to sampling‑based detectors.
Methodology
- Embedding the response: Each token (or sub‑sentence) generated by the LLM is passed through a separate, fixed embedding model (e.g., a sentence‑transformer) to obtain a high‑dimensional vector.
- Trajectory construction: The sequence of vectors forms a time‑ordered trajectory ({x_t}) that is treated as an observable output of an underlying hidden state system.
- Koopman operator fitting: Using a modest set of labeled examples (factual vs. hallucinated), the authors fit two linear operators (K_{\text{fact}}) and (K_{\text{hall}}) that best predict the next embedding:
[ \hat{x}_{t+1}=K,x_t ]
Separate operators capture the distinct dynamics of truthful and untruthful generation regimes. - Residual scoring: For a new LLM response, the method computes the prediction error under each operator:
[ r_{\text{fact}} = |x_{t+1} - K_{\text{fact}}x_t|,\quad r_{\text{hall}} = |x_{t+1} - K_{\text{hall}}x_t| ]
The differential residual (s = r_{\text{hall}} - r_{\text{fact}}) serves as the hallucination score—positive values indicate a higher likelihood of hallucination. - Calibration layer: A small validation set (e.g., 50–100 examples) is used to pick a decision threshold that respects a user‑specified trade‑off (e.g., prioritize precision for medical advice). This step is inexpensive and can be re‑run when domain requirements shift.
Results & Findings
| Benchmark | Metric (F1) | Baseline (sampling) | Proposed method |
|---|---|---|---|
| FactBench (news) | 0.84 | 0.78 | 0.86 |
| MedHall (clinical notes) | 0.79 | 0.71 | 0.81 |
| CodeHall (programming Q&A) | 0.82 | 0.75 | 0.84 |
- Resource savings: Average inference time per query dropped from ~120 ms (5‑sample consistency check) to ~35 ms, a ~70 % reduction.
- Robustness to model size: The detector works across LLMs ranging from 7 B to 175 B parameters with only minor performance variance.
- Calibration impact: Adjusting the threshold for high‑precision mode raised precision from 0.78 to 0.92 while only modestly lowering recall (0.68 → 0.62), demonstrating practical control over risk tolerance.
Practical Implications
- Plug‑and‑play safety layer: Because the method only needs the LLM’s output and a separate embedding model, it can be wrapped around any existing API (OpenAI, Anthropic, etc.) without retraining the LLM.
- Low‑cost monitoring for production: SaaS platforms that serve millions of queries can add hallucination detection with negligible additional GPU load, preserving latency budgets.
- Domain‑specific risk management: The calibration step lets teams in regulated fields (healthcare, finance, legal) set stricter thresholds, aligning detection behavior with compliance requirements.
- Developer tooling: IDE extensions or CI pipelines that automatically flag potentially hallucinated code snippets or documentation can integrate this detector to improve code‑review quality.
- Open‑source friendliness: The approach relies on publicly available embedding models and simple linear algebra, making it easy to reproduce and extend in community projects.
Limitations & Future Work
- Embedding dependence: The quality of the detection hinges on the chosen embedding model; poor semantic representations could blur the distinction between factual and hallucinated dynamics.
- Limited to observable trajectories: Extremely short responses (e.g., single‑word answers) provide insufficient temporal data for reliable Koopman fitting, reducing effectiveness in those cases.
- Calibration data requirement: While modest, the need for a labeled demonstration set means the detector must be re‑calibrated when moving to a new domain with different hallucination patterns.
- Future directions: The authors suggest exploring nonlinear Koopman extensions (e.g., kernel‑based operators) to capture richer dynamics, and integrating lightweight retrieval signals to further boost detection on edge‑case factual queries.
Authors
- Dan Wilson
- Mohamed Akrout
Paper Information
- arXiv ID: 2605.05134v1
- Categories: cs.LG, math.DS
- Published: May 6, 2026
- PDF: Download PDF