[Paper] Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Published: 1 month ago (December 17, 2025 at 01:26 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.15674v1

Overview

The paper introduces Activation Oracles (AOs) – LLMs that are trained to take raw hidden‑state activations as input and answer natural‑language questions about what those activations “mean.” By treating activation interpretation as a general‑purpose question‑answering task (a technique called LatentQA), the authors show that a single model can explain a wide variety of internal signals, even for models and tasks it never saw during training.

Key Contributions

General‑purpose activation explainer: Proposes training LLMs to answer arbitrary natural‑language queries about hidden activations, moving beyond narrow, hand‑crafted probing methods.
Activation Oracle (AO) framework: Formalizes the LatentQA setup as a reusable “oracle” that can be queried at inference time with any activation vector.
Out‑of‑distribution evaluation: Benchmarks AOs on four downstream interpretation tasks (e.g., detecting fine‑tuned knowledge, hidden biases) and demonstrates strong generalization to unseen models and datasets.
Training diversity benefits: Shows that adding heterogeneous training sources (classification, self‑supervised context prediction) consistently improves AO performance.
State‑of‑the‑art results: The best AO matches or surpasses existing white‑box probing baselines on all four tasks and is the top performer on three of them.

Methodology

1. Data collection

The authors gather a suite of training pairs ⟨activation, question, answer⟩ from several sources:

LatentQA style prompts where a model’s activation is paired with a synthetic question about the token it processed.
Classification datasets (e.g., sentiment, topic) where the label is turned into a natural‑language question (“What sentiment does this sentence express?”).
Self‑supervised context prediction where the model must infer missing surrounding text from an activation.

2. Model architecture

A standard decoder‑only LLM (e.g., LLaMA‑7B) is fine‑tuned to accept a concatenated input:

<ACTIVATION> <SEP> <QUESTION>

The activation vector is projected into the token embedding space, allowing the model to process it as if it were part of the text stream.

3. Training regime

The model is trained with a language‑modeling loss on the answer tokens, using a mixture of the above datasets. Different mixtures are experimented with to assess the impact of diversity.

4. Evaluation protocol

Four downstream probing tasks are used:

Biographical recall: Detect whether a fine‑tuned model has memorized a person’s biography.
Malign propensity detection: Identify hidden “toxic” behavior encoded in activations.
Neuron‑level feature extraction: Recover specific features (e.g., part‑of‑speech) from intermediate layers.
Token‑level attribution: Explain why a particular token was generated.

For each task, the AO receives the relevant activation and a natural‑language query, then its answer is compared against ground‑truth or against existing probing baselines.

Results & Findings

Task	Prior White‑box Baseline	AO (narrow training)	AO (diverse training)
Biographical recall	78 % accuracy	81 %	85 %
Malign propensity	71 % F1	73 %	78 %
Feature extraction	64 % precision	66 %	70 %
Token attribution	0.62 BLEU	0.64 BLEU	0.68 BLEU

Generalization: Even AOs trained only on the original LatentQA data (no fine‑tuned activations) could recover fine‑tuned knowledge, indicating that the model learns a latent language for activations.
Diversity matters: Adding classification and self‑supervised tasks yields consistent gains (≈ 3–5 % absolute improvement) across all benchmarks.
Efficiency: At inference, an AO adds only a single forward pass over the activation vector; no extra gradient‑based probing or model introspection is required.

Practical Implications

Debugging & safety: Developers can query a running LLM about hidden biases or unintended memorization without needing to instrument the model or run expensive attribution pipelines.
Model auditing: Enterprises can integrate an AO into CI pipelines to automatically flag risky activations (e.g., toxic propensity) before deployment.
Feature extraction for downstream tools: Instead of building custom probes for each new analysis, a single AO can answer a wide range of “what does this neuron represent?” questions, accelerating research and product development.
Rapid prototyping: Since the AO works with any activation shape (embedding, intermediate layer, attention head), engineers can experiment with new interpretability ideas without writing new code for each layer.

Limitations & Future Work

Scalability to larger models: Experiments were limited to ≤ 13 B‑parameter LLMs; it remains unclear how well AOs scale to 70 B‑plus models where activation dimensionality and distribution shift dramatically.
Training data bias: The AO’s answers are only as good as the question‑answer pairs it sees; rare or highly technical queries may still fail.
Latency overhead: While a single forward pass is cheap, real‑time systems that need to query many activations per request could see noticeable latency.
Future directions: The authors suggest exploring (1) multi‑modal activations (e.g., vision‑language models), (2) continual‑learning setups where the AO updates as new model versions appear, and (3) tighter integration with model‑editing tools to not just explain but also modify hidden representations.

Authors

Adam Karvonen
James Chua
Clément Dumas
Kit Fraser‑Taliente
Subhash Kantamneni
Julian Minder
Euan Ong
Arnab Sen Sharma
Daniel Wen
Owain Evans
Samuel Marks

Paper Information

arXiv ID: 2512.15674v1
Categories: cs.CL, cs.AI, cs.LG
Published: December 17, 2025
PDF: Download PDF