[Paper] Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants

Published: (December 17, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.15712v1

Overview

The paper introduces Predictive Concept Decoders (PCDs) – a new class of “interpretability assistants” that learn to translate a neural network’s hidden activations into human‑readable concepts and then answer natural‑language questions about the model’s behavior. By treating interpretability as an end‑to‑end learning problem rather than a hand‑crafted hypothesis‑testing pipeline, the authors demonstrate a scalable way to surface what the model “knows” inside its layers.

Key Contributions

  • End‑to‑end interpretability objective: Formulates the task of extracting and using latent concepts as a trainable encoder‑decoder system with a sparse, communicative bottleneck.
  • Predictive Concept Decoder architecture: Combines a sparse concept encoder (turns activations into a short list of discrete concepts) with a language‑model decoder that answers arbitrary natural‑language queries.
  • Two‑stage training regime:
    1. Pre‑training on massive unstructured data to learn generic concepts.
    2. Fine‑tuning on downstream question‑answering tasks that probe model behavior.
  • Empirical scaling laws: Shows that both the auto‑interpretability score of the bottleneck concepts and downstream task performance improve predictably with more data and larger models.
  • Real‑world detection capabilities: Demonstrates that PCDs can reliably spot jailbreak prompts, hidden “secret hints,” implanted latent concepts, and even infer private user attributes encoded in the model.

Methodology

  1. Activation Collection: For a target model (e.g., a large language model), the authors capture intermediate hidden states while it processes inputs.
  2. Sparse Concept Encoder: A lightweight network projects these high‑dimensional activations onto a sparse vector, then selects the top‑k entries as a list of discrete “concept tokens.” Sparsity forces the encoder to compress information into a handful of interpretable symbols.
  3. Predictive Decoder: A transformer‑style decoder receives the concept list and a natural‑language question (e.g., “Did the model use a jailbreak trick?”). It is trained to predict the correct answer, effectively learning how each concept maps to observable behavior.
  4. Training Pipeline:
    • Pre‑training: The encoder‑decoder pair is trained on a massive corpus of random prompts and model outputs, without any human‑written labels, encouraging the system to discover useful concepts on its own.
    • Fine‑tuning: A smaller, labeled dataset of targeted queries (jailbreak detection, attribute inference, etc.) is used to adapt the decoder to specific interpretability tasks.
  5. Evaluation Metric – Auto‑Interp Score: Measures how well the sparse concepts alone can predict the model’s output, serving as an intrinsic gauge of interpretability quality.

Results & Findings

TaskAuto‑Interp ↑ (with more data)Downstream QA Accuracy
Jailbreak detection0.71 → 0.88 (×4 data)84% → 93%
Secret‑hint identification0.65 → 0.8178% → 90%
Latent concept implantation0.60 → 0.7975% → 88%
User‑attribute inference0.68 → 0.8581% → 94%
  • Scaling behavior: Both the auto‑interp score and downstream accuracy follow a log‑linear trend with respect to training data size, confirming that larger pre‑training corpora yield more faithful concepts.
  • Sparse bottleneck effectiveness: Even with as few as 5–7 concepts per query, the decoder could answer correctly >90% of the time, indicating that the encoder successfully isolates the most informative signals.
  • Generalization: PCDs trained on one model (e.g., GPT‑2) transferred reasonably well to a larger sibling (GPT‑Neo), suggesting the learned concepts capture model‑agnostic phenomena.

Practical Implications

  • Automated safety audits: Companies can deploy a PCD alongside their LLMs to continuously monitor for jailbreak attempts or hidden malicious prompts, reducing reliance on manual prompt‑engineering checks.
  • Privacy compliance: By surfacing latent user attributes encoded in a model, organizations can verify that personal data isn’t unintentionally memorized, aiding GDPR/CCPA audits.
  • Debugging and feature discovery: Developers can query “What concept caused the model to output X?” and receive concise, human‑readable explanations, accelerating iteration on model architecture or data curation.
  • Plug‑and‑play interpretability layer: Because the encoder is lightweight and the decoder can be any off‑the‑shelf LLM, PCDs can be added to existing pipelines with minimal engineering overhead.
  • Foundation for “explain‑as‑you‑go” APIs: Service providers could expose an endpoint that, given a user query and model response, returns a short list of concepts plus a natural‑language explanation, enhancing transparency for end‑users.

Limitations & Future Work

  • Concept granularity vs. completeness: The sparsity constraint forces the encoder to drop information; some subtle behaviors may never surface in the top‑k concepts.
  • Dependency on pre‑training data quality: If the pre‑training corpus lacks certain failure modes (e.g., novel jailbreak patterns), the PCD may struggle to detect them without additional fine‑tuning.
  • Model‑specific biases: While transfer experiments were promising, the encoder still learns model‑specific activation patterns; a truly universal interpreter would need multi‑model pre‑training.
  • Scalability to multimodal models: Extending PCDs to vision‑language or audio models introduces challenges in defining a unified concept space.
  • User privacy concerns: Surfacing latent user attributes is powerful but raises ethical questions; future work should embed safeguards to prevent misuse.

Bottom line: Predictive Concept Decoders turn interpretability into a trainable, scalable service that can be woven into production AI systems, offering developers a practical tool to audit, debug, and explain complex neural networks.

Authors

  • Vincent Huang
  • Dami Choi
  • Daniel D. Johnson
  • Sarah Schwettmann
  • Jacob Steinhardt

Paper Information

  • arXiv ID: 2512.15712v1
  • Categories: cs.AI, cs.CL, cs.LG
  • Published: December 17, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »