[Paper] Latent Debate: A Surrogate Framework for Interpreting LLM Thinking
Source: arXiv - 2512.01909v1
Overview
The paper Latent Debate: A Surrogate Framework for Interpreting LLM Thinking proposes a new way to peek inside a single large language model (LLM) while it is answering a question. Instead of running multiple models or prompting the same model to argue with itself, the authors extract implicit “supporting” and “attacking” signals that naturally emerge inside the model’s hidden layers. This “latent debate” gives a structured, human‑readable surrogate that mirrors the LLM’s decision‑making and even flags when the model is likely to hallucinate.
Key Contributions
- Latent Debate Concept – Introduces a model‑agnostic framework that treats the hidden activations of a single LLM as an internal debate between arguments for and against a prediction.
- Symbolic Instantiation – Provides a concrete implementation for True/False tasks, mapping activation patterns to explicit support/attack scores.
- Faithful Surrogate Model – Shows that the surrogate’s predictions align closely (≈ 95 % agreement) with the original LLM, confirming it captures the core reasoning process.
- Hallucination Detector – Demonstrates that debate‑pattern features (e.g., high mid‑layer conflict) serve as a strong baseline for spotting hallucinated outputs.
- Empirical Correlation Analysis – Reveals systematic links between debate intensity at different layers and the likelihood of hallucination, offering a diagnostic lens for model behavior.
Methodology
-
Conceptual Layer – Treat each hidden layer as a collection of “arguments.” Positive‑valued neurons are interpreted as support for the predicted answer, while negative‑valued neurons act as attack signals.
-
Symbolic Approximation – For a binary (True/False) task, the authors define a simple scoring function:
[ \text{Score} = \sum_{l}\big(\underbrace{\sum_{i\in\text{support}l} h{i}^{(l)}}{\text{support}} - \underbrace{\sum{j\in\text{attack}l} h{j}^{(l)}}_{\text{attack}}\big) ]
where (h^{(l)}) are activations at layer (l). The sign of the final score yields the surrogate’s prediction.
-
Training‑Free Extraction – No extra fine‑tuning is required; the surrogate is built directly from the forward pass of the original LLM.
-
Evaluation Pipeline – The authors compare surrogate predictions against the LLM on benchmark True/False datasets, and they compute debate‑pattern statistics (e.g., variance of support vs. attack across layers) to train a lightweight classifier for hallucination detection.
Results & Findings
| Metric | LLM (baseline) | Latent Debate Surrogate |
|---|---|---|
| Accuracy on True/False tasks | 88 % | 86 % |
| Prediction agreement with LLM | — | 95 % |
| Hallucination detection F1 (baseline) | 0.61 | 0.78 |
- High Fidelity – The surrogate reproduces the LLM’s answer in 95 % of cases, confirming that the support/attack decomposition captures most of the decision signal.
- Hallucination Signals – Samples with a peak of attack‑dominant activations in middle layers are 2.3× more likely to be hallucinated.
- Layer‑wise Insights – Early layers tend to show balanced support/attack (low conflict), while middle layers often exhibit the strongest internal disagreement, which correlates with uncertainty and hallucination risk.
Practical Implications
- Debugging LLMs – Developers can visualize latent debates to understand why a model chose a particular answer, making it easier to spot reasoning flaws.
- Safety Filters – The debate‑pattern features can be integrated into production pipelines as a lightweight, model‑agnostic hallucination detector, reducing the need for expensive external verification models.
- Model‑agnostic Auditing – Since the framework works on any transformer‑style LLM without fine‑tuning, it can be applied to closed‑source APIs (e.g., via activation‑logging hooks) for compliance and audit trails.
- Guiding Model Design – Insights about which layers tend to harbor conflict could inform architecture tweaks (e.g., adding regularization to middle layers) to mitigate hallucinations.
- Explainable AI Interfaces – The support/attack scores can be exposed to end‑users (e.g., “70 % of the model’s reasoning supports ‘True’”) to increase trust in AI‑assisted decision making.
Limitations & Future Work
- Task Scope – The current symbolic instantiation is limited to binary True/False tasks; extending to multi‑class or open‑ended generation remains an open challenge.
- Interpretability Approximation – Mapping neuron activations to “support” or “attack” is a heuristic; it may not capture more nuanced reasoning patterns (e.g., compositional logic).
- Scalability – While activation extraction is cheap, visualizing debates for very large models (hundreds of layers) can become cumbersome without dimensionality reduction techniques.
- Future Directions – The authors suggest (i) learning richer latent‑argument representations, (ii) applying the framework to chain‑of‑thought prompting, and (iii) integrating latent debate signals into training objectives to proactively reduce hallucinations.
Authors
- Lihu Chen
- Xiang Yin
- Francesca Toni
Paper Information
- arXiv ID: 2512.01909v1
- Categories: cs.CL
- Published: December 1, 2025
- PDF: Download PDF