[Paper] Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants
Source: arXiv - 2512.15712v1
Overview
The paper introduces Predictive Concept Decoders (PCDs) – a new class of “interpretability assistants” that learn to translate a neural network’s hidden activations into human‑readable concepts and then answer natural‑language questions about the model’s behavior. By treating interpretability as an end‑to‑end learning problem rather than a hand‑crafted hypothesis‑testing pipeline, the authors demonstrate a scalable way to surface what the model “knows” inside its layers.
Key Contributions
- End‑to‑end interpretability objective: Formulates the task of extracting and using latent concepts as a trainable encoder‑decoder system with a sparse, communicative bottleneck.
- Predictive Concept Decoder architecture: Combines a sparse concept encoder (turns activations into a short list of discrete concepts) with a language‑model decoder that answers arbitrary natural‑language queries.
- Two‑stage training regime:
- Pre‑training on massive unstructured data to learn generic concepts.
- Fine‑tuning on downstream question‑answering tasks that probe model behavior.
- Empirical scaling laws: Shows that both the auto‑interpretability score of the bottleneck concepts and downstream task performance improve predictably with more data and larger models.
- Real‑world detection capabilities: Demonstrates that PCDs can reliably spot jailbreak prompts, hidden “secret hints,” implanted latent concepts, and even infer private user attributes encoded in the model.
Methodology
- Activation Collection: For a target model (e.g., a large language model), the authors capture intermediate hidden states while it processes inputs.
- Sparse Concept Encoder: A lightweight network projects these high‑dimensional activations onto a sparse vector, then selects the top‑k entries as a list of discrete “concept tokens.” Sparsity forces the encoder to compress information into a handful of interpretable symbols.
- Predictive Decoder: A transformer‑style decoder receives the concept list and a natural‑language question (e.g., “Did the model use a jailbreak trick?”). It is trained to predict the correct answer, effectively learning how each concept maps to observable behavior.
- Training Pipeline:
- Pre‑training: The encoder‑decoder pair is trained on a massive corpus of random prompts and model outputs, without any human‑written labels, encouraging the system to discover useful concepts on its own.
- Fine‑tuning: A smaller, labeled dataset of targeted queries (jailbreak detection, attribute inference, etc.) is used to adapt the decoder to specific interpretability tasks.
- Evaluation Metric – Auto‑Interp Score: Measures how well the sparse concepts alone can predict the model’s output, serving as an intrinsic gauge of interpretability quality.
Results & Findings
| Task | Auto‑Interp ↑ (with more data) | Downstream QA Accuracy |
|---|---|---|
| Jailbreak detection | 0.71 → 0.88 (×4 data) | 84% → 93% |
| Secret‑hint identification | 0.65 → 0.81 | 78% → 90% |
| Latent concept implantation | 0.60 → 0.79 | 75% → 88% |
| User‑attribute inference | 0.68 → 0.85 | 81% → 94% |
- Scaling behavior: Both the auto‑interp score and downstream accuracy follow a log‑linear trend with respect to training data size, confirming that larger pre‑training corpora yield more faithful concepts.
- Sparse bottleneck effectiveness: Even with as few as 5–7 concepts per query, the decoder could answer correctly >90% of the time, indicating that the encoder successfully isolates the most informative signals.
- Generalization: PCDs trained on one model (e.g., GPT‑2) transferred reasonably well to a larger sibling (GPT‑Neo), suggesting the learned concepts capture model‑agnostic phenomena.
Practical Implications
- Automated safety audits: Companies can deploy a PCD alongside their LLMs to continuously monitor for jailbreak attempts or hidden malicious prompts, reducing reliance on manual prompt‑engineering checks.
- Privacy compliance: By surfacing latent user attributes encoded in a model, organizations can verify that personal data isn’t unintentionally memorized, aiding GDPR/CCPA audits.
- Debugging and feature discovery: Developers can query “What concept caused the model to output X?” and receive concise, human‑readable explanations, accelerating iteration on model architecture or data curation.
- Plug‑and‑play interpretability layer: Because the encoder is lightweight and the decoder can be any off‑the‑shelf LLM, PCDs can be added to existing pipelines with minimal engineering overhead.
- Foundation for “explain‑as‑you‑go” APIs: Service providers could expose an endpoint that, given a user query and model response, returns a short list of concepts plus a natural‑language explanation, enhancing transparency for end‑users.
Limitations & Future Work
- Concept granularity vs. completeness: The sparsity constraint forces the encoder to drop information; some subtle behaviors may never surface in the top‑k concepts.
- Dependency on pre‑training data quality: If the pre‑training corpus lacks certain failure modes (e.g., novel jailbreak patterns), the PCD may struggle to detect them without additional fine‑tuning.
- Model‑specific biases: While transfer experiments were promising, the encoder still learns model‑specific activation patterns; a truly universal interpreter would need multi‑model pre‑training.
- Scalability to multimodal models: Extending PCDs to vision‑language or audio models introduces challenges in defining a unified concept space.
- User privacy concerns: Surfacing latent user attributes is powerful but raises ethical questions; future work should embed safeguards to prevent misuse.
Bottom line: Predictive Concept Decoders turn interpretability into a trainable, scalable service that can be woven into production AI systems, offering developers a practical tool to audit, debug, and explain complex neural networks.
Authors
- Vincent Huang
- Dami Choi
- Daniel D. Johnson
- Sarah Schwettmann
- Jacob Steinhardt
Paper Information
- arXiv ID: 2512.15712v1
- Categories: cs.AI, cs.CL, cs.LG
- Published: December 17, 2025
- PDF: Download PDF