[Paper] Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction

Published: (May 6, 2026 at 01:07 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2605.05134v1

Overview

Large Language Models (LLMs) are great at producing fluent text, but they often slip into “hallucinations” – statements that sound plausible yet are factually wrong. The paper Low‑Cost Black‑Box Detection of LLM Hallucinations via Dynamical System Prediction introduces a novel way to spot these errors without the heavy compute or external knowledge bases that most existing detectors need. By treating an LLM as a black‑box dynamical system and applying concepts from Koopman operator theory, the authors achieve state‑of‑the‑art detection with a single forward pass.

Key Contributions

  • Black‑box dynamical‑system view: Re‑frames LLM output sequences as trajectories in a high‑dimensional latent state space, sidestepping the need to peek inside the model.
  • Koopman‑based transition modeling: Learns linear operators that approximate the evolution of factual vs. hallucinated response trajectories, enabling a cheap prediction‑error score.
  • Differential residual score: Computes the mismatch between observed token embeddings and the two regime‑specific Koopman predictions, yielding a robust hallucination indicator.
  • Preference‑aware calibration: Introduces a lightweight, demonstration‑driven threshold‑tuning step that lets users bias the detector toward higher precision or recall depending on domain risk.
  • Empirical validation: Shows competitive or superior performance on three benchmark datasets while cutting inference cost by up to 70 % compared to sampling‑based detectors.

Methodology

  1. Embedding the response: Each token (or sub‑sentence) generated by the LLM is passed through a separate, fixed embedding model (e.g., a sentence‑transformer) to obtain a high‑dimensional vector.
  2. Trajectory construction: The sequence of vectors forms a time‑ordered trajectory ({x_t}) that is treated as an observable output of an underlying hidden state system.
  3. Koopman operator fitting: Using a modest set of labeled examples (factual vs. hallucinated), the authors fit two linear operators (K_{\text{fact}}) and (K_{\text{hall}}) that best predict the next embedding:
    [ \hat{x}_{t+1}=K,x_t ]
    Separate operators capture the distinct dynamics of truthful and untruthful generation regimes.
  4. Residual scoring: For a new LLM response, the method computes the prediction error under each operator:
    [ r_{\text{fact}} = |x_{t+1} - K_{\text{fact}}x_t|,\quad r_{\text{hall}} = |x_{t+1} - K_{\text{hall}}x_t| ]
    The differential residual (s = r_{\text{hall}} - r_{\text{fact}}) serves as the hallucination score—positive values indicate a higher likelihood of hallucination.
  5. Calibration layer: A small validation set (e.g., 50–100 examples) is used to pick a decision threshold that respects a user‑specified trade‑off (e.g., prioritize precision for medical advice). This step is inexpensive and can be re‑run when domain requirements shift.

Results & Findings

BenchmarkMetric (F1)Baseline (sampling)Proposed method
FactBench (news)0.840.780.86
MedHall (clinical notes)0.790.710.81
CodeHall (programming Q&A)0.820.750.84
  • Resource savings: Average inference time per query dropped from ~120 ms (5‑sample consistency check) to ~35 ms, a ~70 % reduction.
  • Robustness to model size: The detector works across LLMs ranging from 7 B to 175 B parameters with only minor performance variance.
  • Calibration impact: Adjusting the threshold for high‑precision mode raised precision from 0.78 to 0.92 while only modestly lowering recall (0.68 → 0.62), demonstrating practical control over risk tolerance.

Practical Implications

  • Plug‑and‑play safety layer: Because the method only needs the LLM’s output and a separate embedding model, it can be wrapped around any existing API (OpenAI, Anthropic, etc.) without retraining the LLM.
  • Low‑cost monitoring for production: SaaS platforms that serve millions of queries can add hallucination detection with negligible additional GPU load, preserving latency budgets.
  • Domain‑specific risk management: The calibration step lets teams in regulated fields (healthcare, finance, legal) set stricter thresholds, aligning detection behavior with compliance requirements.
  • Developer tooling: IDE extensions or CI pipelines that automatically flag potentially hallucinated code snippets or documentation can integrate this detector to improve code‑review quality.
  • Open‑source friendliness: The approach relies on publicly available embedding models and simple linear algebra, making it easy to reproduce and extend in community projects.

Limitations & Future Work

  • Embedding dependence: The quality of the detection hinges on the chosen embedding model; poor semantic representations could blur the distinction between factual and hallucinated dynamics.
  • Limited to observable trajectories: Extremely short responses (e.g., single‑word answers) provide insufficient temporal data for reliable Koopman fitting, reducing effectiveness in those cases.
  • Calibration data requirement: While modest, the need for a labeled demonstration set means the detector must be re‑calibrated when moving to a new domain with different hallucination patterns.
  • Future directions: The authors suggest exploring nonlinear Koopman extensions (e.g., kernel‑based operators) to capture richer dynamics, and integrating lightweight retrieval signals to further boost detection on edge‑case factual queries.

Authors

  • Dan Wilson
  • Mohamed Akrout

Paper Information

  • arXiv ID: 2605.05134v1
  • Categories: cs.LG, math.DS
  • Published: May 6, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Normalizing Trajectory Models

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coar...