[Paper] Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits
Source: arXiv - 2512.20578v1
Overview
The paper Can LLMs Predict Their Own Failures? Self‑Awareness via Internal Circuits investigates whether a frozen large language model (LLM) can “look inside” its own computation to spot when it is about to make a mistake. The authors introduce Gnosis, a tiny add‑on that reads hidden‑state and attention signals during generation and predicts the correctness of the output with almost no extra cost.
Key Contributions
- Gnosis architecture: a lightweight (~5 M parameters) module that extracts fixed‑size descriptors from an LLM’s internal tensors (hidden states, attention maps) without modifying the base model.
- Self‑verification without external judges: Gnosis predicts correctness directly from the model’s own dynamics, avoiding costly multi‑sample consistency checks or separate verification models.
- Broad empirical coverage: evaluated on math reasoning, open‑domain QA, and academic knowledge tasks across 1.7 B – 20 B parameter frozen backbones.
- Superior accuracy & calibration: consistently beats strong internal baselines and even large external judges in both raw prediction accuracy and confidence alignment.
- Zero‑shot early‑failure detection: can flag a failing generation after only a partial token sequence, enabling compute‑aware control (e.g., early termination or model switching).
Methodology
- Signal collection – While the LLM generates each token, Gnosis passively records a small set of internal activations:
- the final hidden vector of the current token,
- a pooled summary of the attention‑weight matrix for that step.
- Compression – These raw tensors are projected through a tiny feed‑forward network into a fixed‑budget “descriptor” (e.g., a 128‑dim vector). The compression is designed to be length‑agnostic, so the descriptor size does not grow with the sequence.
- Prediction head – A lightweight classifier (binary or calibrated confidence output) consumes the descriptor and predicts whether the upcoming token (or the whole completed answer) will be correct.
- Training – Gnosis is trained on a held‑out validation set where the ground‑truth correctness is known (e.g., math problem solutions). Importantly, the base LLM remains frozen; only Gnosis’s parameters are updated.
- Inference – At test time, Gnosis runs alongside the frozen LLM, adding only a few milliseconds per token and a negligible memory footprint.
Results & Findings
| Benchmark | Model Size | Gnosis Accuracy (↑) | External Judge Accuracy (↓) |
|---|---|---|---|
| GSM‑8K (math) | 7 B | 78 % | 65 % |
| Natural Questions | 13 B | 71 % | 62 % |
| Academic QA (SciFact) | 20 B | 74 % | 68 % |
- Calibration: Gnosis’s confidence scores exhibit lower Expected Calibration Error (ECE) than baselines, meaning its probability estimates are more trustworthy.
- Early detection: When evaluated after only the first 30 % of a generation, Gnosis still predicts failure with >70 % accuracy, enabling dynamic compute decisions.
- Parameter efficiency: Adding ~5 M parameters to a 20 B model translates to <0.03 % overhead, yet yields a >10 % boost in failure‑prediction performance over the best internal baseline.
Practical Implications
- Compute‑aware generation: Systems can abort a hopeless answer early, switch to a larger model, or request clarification, saving GPU cycles and latency.
- Safety & reliability layers: Deployments (e.g., code assistants, medical QA) can attach Gnosis as a “self‑monitor” that flags potentially hallucinated outputs before they reach users.
- Improved user experience: Front‑ends can surface confidence scores or warnings derived from Gnosis, helping developers build more transparent AI assistants.
- Zero‑cost integration: Since Gnosis works with frozen backbones, existing production models can be retrofitted without retraining the massive LLM, making adoption feasible for SaaS providers.
Limitations & Future Work
- Training data dependence: Gnosis needs a labeled correctness dataset for each task domain; its zero‑shot ability is limited to detecting failures, not to learning new task semantics.
- Scope of signals: The current design only taps hidden states and attention weights; other internal cues (e.g., feed‑forward activations, gradient‑based signals) might further improve detection.
- Generalization to multimodal models: The study focuses on text‑only LLMs; extending Gnosis to vision‑language or audio models remains an open question.
- Robustness to adversarial prompting: The authors note that crafted prompts could potentially manipulate internal patterns, a direction for future robustness research.
Bottom line: Gnosis demonstrates that LLMs already encode useful “self‑knowledge” in their internal dynamics, and a tiny, model‑agnostic add‑on can unlock this for practical, low‑overhead reliability checks. This opens a promising path toward more trustworthy AI systems without the heavy compute penalties of external verification pipelines.
Authors
- Amirhosein Ghasemabadi
- Di Niu
Paper Information
- arXiv ID: 2512.20578v1
- Categories: cs.CL
- Published: December 23, 2025
- PDF: Download PDF