[Paper] Internal Representations as Indicators of Hallucinations in Agent Tool Selection
Source: arXiv - 2601.05214v1
Overview
The paper tackles a subtle but critical failure mode of large language model (LLM) agents: hallucinating the wrong tool, malformed parameters, or “bypassing” a tool altogether. While LLMs can call APIs, run shells, or query databases, they sometimes generate outputs that look plausible yet never actually invoke the intended external service. The authors propose a lightweight, real‑time detection framework that reads the model’s own internal hidden states during the same forward pass used for generation, flagging these hallucinations before they cause downstream damage.
Key Contributions
- In‑situ hallucination detector: Leverages intermediate token‑level representations (attention weights, hidden activations) to predict tool‑calling errors without extra model passes or external validators.
- Unified detection of three error types: (1) wrong tool selection, (2) malformed or missing parameters, and (3) tool‑bypass (simulation instead of real call).
- Domain‑agnostic evaluation: Tested on reasoning benchmarks spanning code execution, web search, and data‑retrieval tasks, achieving up to 86.4 % accuracy in real‑time settings.
- Minimal overhead: The detector adds < 5 ms latency on a typical 7B‑parameter LLM, preserving the low‑latency guarantees needed for production agents.
- Open‑source reference implementation: Includes a plug‑and‑play wrapper for popular LLM inference libraries (e.g., HuggingFace Transformers, vLLM).
Methodology
- Instrument the forward pass – While the LLM generates the next token, the framework extracts a small set of hidden vectors (e.g., the last‑layer hidden state and the attention scores for the “tool‑call” token).
- Feature construction – These vectors are projected into a lightweight classifier (a 2‑layer MLP) that has been fine‑tuned on a labeled dataset of correct vs. hallucinated tool calls.
- Binary decision – The classifier outputs a confidence score; if it exceeds a configurable threshold, the system aborts the generation, logs the event, and optionally falls back to a safe default (e.g., ask the user for clarification or invoke a verification service).
- Training data – The authors created a synthetic corpus where the same prompt is paired with both a correct tool call and a deliberately corrupted version (wrong tool, missing args, or simulated output). This yields a balanced training set without requiring expensive human annotation.
The entire pipeline runs in a single forward pass, meaning the detector does not require a second inference or a separate verification model.
Results & Findings
| Task Domain | Detection Accuracy | Precision (Hallucination) | Recall (Hallucination) |
|---|---|---|---|
| Code execution (Python REPL) | 84.1 % | 0.88 | 0.79 |
| Web search API | 86.4 % | 0.91 | 0.82 |
| Database query tool | 81.7 % | 0.85 | 0.78 |
| Mixed‑domain benchmark | 83.2 % | 0.87 | 0.80 |
- Parameter‑level errors (e.g., malformed JSON) were detected with the highest recall (> 85 %).
- Tool‑bypass cases (where the model “pretended” to run a tool) were the hardest, but still achieved > 80 % precision.
- Adding the detector increased end‑to‑end latency by 3–5 ms on a GPU‑accelerated inference server, well within typical SLA windows for interactive agents.
Practical Implications
- Safer production agents – Real‑time detection lets you reject or quarantine suspicious tool calls before they hit external services, protecting API keys, rate limits, and audit logs.
- Reduced debugging time – Developers can instrument their agents with the detector and get immediate alerts when hallucinations occur, cutting down on post‑mortem analysis.
- Cost savings – By avoiding unnecessary external calls (especially to paid APIs), organizations can lower operational expenses.
- Compliance & audit – The framework can be wired into existing security pipelines to enforce “no‑bypass” policies, ensuring every action is traceable.
- Plug‑and‑play integration – Because the detector works on hidden states, it can be added to any Transformer‑based LLM without retraining the base model, making it attractive for teams using off‑the‑shelf models (e.g., OpenAI’s GPT‑4, LLaMA‑2, Claude).
Limitations & Future Work
- Model‑specific tuning – The classifier was trained on a handful of model sizes (7B–13B). Transferability to much larger or fundamentally different architectures (e.g., decoder‑only vs. encoder‑decoder) may require additional fine‑tuning.
- Synthetic training bias – While synthetic hallucinations cover many patterns, they may not capture rare, real‑world edge cases that emerge in production.
- Threshold sensitivity – Choosing the detection threshold trades off false positives (unnecessary aborts) against missed hallucinations; adaptive thresholds per domain are an open research direction.
- Extending beyond tool calls – The authors plan to explore whether the same internal‑representation signals can flag other LLM failures, such as factual inaccuracies or policy violations.
Overall, the paper offers a pragmatic, low‑overhead solution that brings the reliability of LLM‑driven agents a step closer to production‑grade standards.
Authors
- Kait Healy
- Bharathi Srinivasan
- Visakh Madathil
- Jing Wu
Paper Information
- arXiv ID: 2601.05214v1
- Categories: cs.AI
- Published: January 8, 2026
- PDF: Download PDF