[Paper] Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits

Published: (December 23, 2025 at 01:21 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.20578v1

Overview

The paper Can LLMs Predict Their Own Failures? Self‑Awareness via Internal Circuits investigates whether a frozen large language model (LLM) can “look inside” its own computation to spot when it is about to make a mistake. The authors introduce Gnosis, a tiny add‑on that reads hidden‑state and attention signals during generation and predicts the correctness of the output with almost no extra cost.

Key Contributions

  • Gnosis architecture: a lightweight (~5 M parameters) module that extracts fixed‑size descriptors from an LLM’s internal tensors (hidden states, attention maps) without modifying the base model.
  • Self‑verification without external judges: Gnosis predicts correctness directly from the model’s own dynamics, avoiding costly multi‑sample consistency checks or separate verification models.
  • Broad empirical coverage: evaluated on math reasoning, open‑domain QA, and academic knowledge tasks across 1.7 B – 20 B parameter frozen backbones.
  • Superior accuracy & calibration: consistently beats strong internal baselines and even large external judges in both raw prediction accuracy and confidence alignment.
  • Zero‑shot early‑failure detection: can flag a failing generation after only a partial token sequence, enabling compute‑aware control (e.g., early termination or model switching).

Methodology

  1. Signal collection – While the LLM generates each token, Gnosis passively records a small set of internal activations:
    • the final hidden vector of the current token,
    • a pooled summary of the attention‑weight matrix for that step.
  2. Compression – These raw tensors are projected through a tiny feed‑forward network into a fixed‑budget “descriptor” (e.g., a 128‑dim vector). The compression is designed to be length‑agnostic, so the descriptor size does not grow with the sequence.
  3. Prediction head – A lightweight classifier (binary or calibrated confidence output) consumes the descriptor and predicts whether the upcoming token (or the whole completed answer) will be correct.
  4. Training – Gnosis is trained on a held‑out validation set where the ground‑truth correctness is known (e.g., math problem solutions). Importantly, the base LLM remains frozen; only Gnosis’s parameters are updated.
  5. Inference – At test time, Gnosis runs alongside the frozen LLM, adding only a few milliseconds per token and a negligible memory footprint.

Results & Findings

BenchmarkModel SizeGnosis Accuracy (↑)External Judge Accuracy (↓)
GSM‑8K (math)7 B78 %65 %
Natural Questions13 B71 %62 %
Academic QA (SciFact)20 B74 %68 %
  • Calibration: Gnosis’s confidence scores exhibit lower Expected Calibration Error (ECE) than baselines, meaning its probability estimates are more trustworthy.
  • Early detection: When evaluated after only the first 30 % of a generation, Gnosis still predicts failure with >70 % accuracy, enabling dynamic compute decisions.
  • Parameter efficiency: Adding ~5 M parameters to a 20 B model translates to <0.03 % overhead, yet yields a >10 % boost in failure‑prediction performance over the best internal baseline.

Practical Implications

  • Compute‑aware generation: Systems can abort a hopeless answer early, switch to a larger model, or request clarification, saving GPU cycles and latency.
  • Safety & reliability layers: Deployments (e.g., code assistants, medical QA) can attach Gnosis as a “self‑monitor” that flags potentially hallucinated outputs before they reach users.
  • Improved user experience: Front‑ends can surface confidence scores or warnings derived from Gnosis, helping developers build more transparent AI assistants.
  • Zero‑cost integration: Since Gnosis works with frozen backbones, existing production models can be retrofitted without retraining the massive LLM, making adoption feasible for SaaS providers.

Limitations & Future Work

  • Training data dependence: Gnosis needs a labeled correctness dataset for each task domain; its zero‑shot ability is limited to detecting failures, not to learning new task semantics.
  • Scope of signals: The current design only taps hidden states and attention weights; other internal cues (e.g., feed‑forward activations, gradient‑based signals) might further improve detection.
  • Generalization to multimodal models: The study focuses on text‑only LLMs; extending Gnosis to vision‑language or audio models remains an open question.
  • Robustness to adversarial prompting: The authors note that crafted prompts could potentially manipulate internal patterns, a direction for future robustness research.

Bottom line: Gnosis demonstrates that LLMs already encode useful “self‑knowledge” in their internal dynamics, and a tiny, model‑agnostic add‑on can unlock this for practical, low‑overhead reliability checks. This opens a promising path toward more trustworthy AI systems without the heavy compute penalties of external verification pipelines.

Authors

  • Amirhosein Ghasemabadi
  • Di Niu

Paper Information

  • arXiv ID: 2512.20578v1
  • Categories: cs.CL
  • Published: December 23, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »