[Paper] Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants

Published: 1 month ago (December 17, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.15712v1

Overview

The paper introduces Predictive Concept Decoders (PCDs) – a new class of “interpretability assistants” that learn to translate a neural network’s hidden activations into human‑readable concepts and then answer natural‑language questions about the model’s behavior. By treating interpretability as an end‑to‑end learning problem rather than a hand‑crafted hypothesis‑testing pipeline, the authors demonstrate a scalable way to surface what the model “knows” inside its layers.

Key Contributions

End‑to‑end interpretability objective: Formulates the task of extracting and using latent concepts as a trainable encoder‑decoder system with a sparse, communicative bottleneck.
Predictive Concept Decoder architecture: Combines a sparse concept encoder (turns activations into a short list of discrete concepts) with a language‑model decoder that answers arbitrary natural‑language queries.
Two‑stage training regime:
1. Pre‑training on massive unstructured data to learn generic concepts.
2. Fine‑tuning on downstream question‑answering tasks that probe model behavior.
Empirical scaling laws: Shows that both the auto‑interpretability score of the bottleneck concepts and downstream task performance improve predictably with more data and larger models.
Real‑world detection capabilities: Demonstrates that PCDs can reliably spot jailbreak prompts, hidden “secret hints,” implanted latent concepts, and even infer private user attributes encoded in the model.

Methodology

Activation Collection: For a target model (e.g., a large language model), the authors capture intermediate hidden states while it processes inputs.
Sparse Concept Encoder: A lightweight network projects these high‑dimensional activations onto a sparse vector, then selects the top‑k entries as a list of discrete “concept tokens.” Sparsity forces the encoder to compress information into a handful of interpretable symbols.
Predictive Decoder: A transformer‑style decoder receives the concept list and a natural‑language question (e.g., “Did the model use a jailbreak trick?”). It is trained to predict the correct answer, effectively learning how each concept maps to observable behavior.
Training Pipeline:
- Pre‑training: The encoder‑decoder pair is trained on a massive corpus of random prompts and model outputs, without any human‑written labels, encouraging the system to discover useful concepts on its own.
- Fine‑tuning: A smaller, labeled dataset of targeted queries (jailbreak detection, attribute inference, etc.) is used to adapt the decoder to specific interpretability tasks.
Evaluation Metric – Auto‑Interp Score: Measures how well the sparse concepts alone can predict the model’s output, serving as an intrinsic gauge of interpretability quality.

Results & Findings

Task	Auto‑Interp ↑ (with more data)	Downstream QA Accuracy
Jailbreak detection	0.71 → 0.88 (×4 data)	84% → 93%
Secret‑hint identification	0.65 → 0.81	78% → 90%
Latent concept implantation	0.60 → 0.79	75% → 88%
User‑attribute inference	0.68 → 0.85	81% → 94%

Scaling behavior: Both the auto‑interp score and downstream accuracy follow a log‑linear trend with respect to training data size, confirming that larger pre‑training corpora yield more faithful concepts.
Sparse bottleneck effectiveness: Even with as few as 5–7 concepts per query, the decoder could answer correctly >90% of the time, indicating that the encoder successfully isolates the most informative signals.
Generalization: PCDs trained on one model (e.g., GPT‑2) transferred reasonably well to a larger sibling (GPT‑Neo), suggesting the learned concepts capture model‑agnostic phenomena.

Practical Implications

Automated safety audits: Companies can deploy a PCD alongside their LLMs to continuously monitor for jailbreak attempts or hidden malicious prompts, reducing reliance on manual prompt‑engineering checks.
Privacy compliance: By surfacing latent user attributes encoded in a model, organizations can verify that personal data isn’t unintentionally memorized, aiding GDPR/CCPA audits.
Debugging and feature discovery: Developers can query “What concept caused the model to output X?” and receive concise, human‑readable explanations, accelerating iteration on model architecture or data curation.
Plug‑and‑play interpretability layer: Because the encoder is lightweight and the decoder can be any off‑the‑shelf LLM, PCDs can be added to existing pipelines with minimal engineering overhead.
Foundation for “explain‑as‑you‑go” APIs: Service providers could expose an endpoint that, given a user query and model response, returns a short list of concepts plus a natural‑language explanation, enhancing transparency for end‑users.

Limitations & Future Work

Concept granularity vs. completeness: The sparsity constraint forces the encoder to drop information; some subtle behaviors may never surface in the top‑k concepts.
Dependency on pre‑training data quality: If the pre‑training corpus lacks certain failure modes (e.g., novel jailbreak patterns), the PCD may struggle to detect them without additional fine‑tuning.
Model‑specific biases: While transfer experiments were promising, the encoder still learns model‑specific activation patterns; a truly universal interpreter would need multi‑model pre‑training.
Scalability to multimodal models: Extending PCDs to vision‑language or audio models introduces challenges in defining a unified concept space.
User privacy concerns: Surfacing latent user attributes is powerful but raises ethical questions; future work should embed safeguards to prevent misuse.

Bottom line: Predictive Concept Decoders turn interpretability into a trainable, scalable service that can be woven into production AI systems, offering developers a practical tool to audit, debug, and explain complex neural networks.

Authors

Vincent Huang
Dami Choi
Daniel D. Johnson
Sarah Schwettmann
Jacob Steinhardt

Paper Information

arXiv ID: 2512.15712v1
Categories: cs.AI, cs.CL, cs.LG
Published: December 17, 2025
PDF: Download PDF

[Paper] Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] When Reasoning Meets Its Laws

[Paper] ShareChat: A Dataset of Chatbot Conversations in the Wild

[Paper] Bangla MedER: Multi-BERT Ensemble Approach for the Recognition of Bangla Medical Entity

[Paper] AncientBench: Towards Comprehensive Evaluation on Excavated and Transmitted Chinese Corpora