[Paper] Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits

Published: 1 month ago (December 23, 2025 at 01:21 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.20578v1

Overview

The paper Can LLMs Predict Their Own Failures? Self‑Awareness via Internal Circuits investigates whether a frozen large language model (LLM) can “look inside” its own computation to spot when it is about to make a mistake. The authors introduce Gnosis, a tiny add‑on that reads hidden‑state and attention signals during generation and predicts the correctness of the output with almost no extra cost.

Key Contributions

Gnosis architecture: a lightweight (~5 M parameters) module that extracts fixed‑size descriptors from an LLM’s internal tensors (hidden states, attention maps) without modifying the base model.
Self‑verification without external judges: Gnosis predicts correctness directly from the model’s own dynamics, avoiding costly multi‑sample consistency checks or separate verification models.
Broad empirical coverage: evaluated on math reasoning, open‑domain QA, and academic knowledge tasks across 1.7 B – 20 B parameter frozen backbones.
Superior accuracy & calibration: consistently beats strong internal baselines and even large external judges in both raw prediction accuracy and confidence alignment.
Zero‑shot early‑failure detection: can flag a failing generation after only a partial token sequence, enabling compute‑aware control (e.g., early termination or model switching).

Methodology

Signal collection – While the LLM generates each token, Gnosis passively records a small set of internal activations:
- the final hidden vector of the current token,
- a pooled summary of the attention‑weight matrix for that step.
Compression – These raw tensors are projected through a tiny feed‑forward network into a fixed‑budget “descriptor” (e.g., a 128‑dim vector). The compression is designed to be length‑agnostic, so the descriptor size does not grow with the sequence.
Prediction head – A lightweight classifier (binary or calibrated confidence output) consumes the descriptor and predicts whether the upcoming token (or the whole completed answer) will be correct.
Training – Gnosis is trained on a held‑out validation set where the ground‑truth correctness is known (e.g., math problem solutions). Importantly, the base LLM remains frozen; only Gnosis’s parameters are updated.
Inference – At test time, Gnosis runs alongside the frozen LLM, adding only a few milliseconds per token and a negligible memory footprint.

Results & Findings

Benchmark	Model Size	Gnosis Accuracy (↑)	External Judge Accuracy (↓)
GSM‑8K (math)	7 B	78 %	65 %
Natural Questions	13 B	71 %	62 %
Academic QA (SciFact)	20 B	74 %	68 %

Calibration: Gnosis’s confidence scores exhibit lower Expected Calibration Error (ECE) than baselines, meaning its probability estimates are more trustworthy.
Early detection: When evaluated after only the first 30 % of a generation, Gnosis still predicts failure with >70 % accuracy, enabling dynamic compute decisions.
Parameter efficiency: Adding ~5 M parameters to a 20 B model translates to <0.03 % overhead, yet yields a >10 % boost in failure‑prediction performance over the best internal baseline.

Practical Implications

Compute‑aware generation: Systems can abort a hopeless answer early, switch to a larger model, or request clarification, saving GPU cycles and latency.
Safety & reliability layers: Deployments (e.g., code assistants, medical QA) can attach Gnosis as a “self‑monitor” that flags potentially hallucinated outputs before they reach users.
Improved user experience: Front‑ends can surface confidence scores or warnings derived from Gnosis, helping developers build more transparent AI assistants.
Zero‑cost integration: Since Gnosis works with frozen backbones, existing production models can be retrofitted without retraining the massive LLM, making adoption feasible for SaaS providers.

Limitations & Future Work

Training data dependence: Gnosis needs a labeled correctness dataset for each task domain; its zero‑shot ability is limited to detecting failures, not to learning new task semantics.
Scope of signals: The current design only taps hidden states and attention weights; other internal cues (e.g., feed‑forward activations, gradient‑based signals) might further improve detection.
Generalization to multimodal models: The study focuses on text‑only LLMs; extending Gnosis to vision‑language or audio models remains an open question.
Robustness to adversarial prompting: The authors note that crafted prompts could potentially manipulate internal patterns, a direction for future robustness research.

Bottom line: Gnosis demonstrates that LLMs already encode useful “self‑knowledge” in their internal dynamics, and a tiny, model‑agnostic add‑on can unlock this for practical, low‑overhead reliability checks. This opens a promising path toward more trustworthy AI systems without the heavy compute penalties of external verification pipelines.

Authors

Amirhosein Ghasemabadi
Di Niu

Paper Information

arXiv ID: 2512.20578v1
Categories: cs.CL
Published: December 23, 2025
PDF: Download PDF

[Paper] Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] A2P-Vis: an Analyzer-to-Presenter Agentic Pipeline for Visual Insights Generation and Reporting

[Paper] Introducing TrGLUE and SentiTurca: A Comprehensive Benchmark for Turkish General Language Understanding and Sentiment Analysis

[Paper] Unifying Learning Dynamics and Generalization in Transformers Scaling Law

[Paper] Context as a Tool: Context Management for Long-Horizon SWE-Agents