[Paper] Step-resolved data attribution for looped transformers
Source: arXiv - 2602.10097v1
Overview
The paper “Step‑resolved data attribution for looped transformers” tackles a blind spot in modern interpretability tools: they tell you which training examples mattered, but not when during a model’s recurrent reasoning they mattered. By unrolling the computation of looped (recurrent) transformers—e.g., GPT‑style models that apply the same block τ times—the authors introduce a fine‑grained influence estimator that reveals the exact iteration at which a training example exerts its effect.
Key Contributions
- Step‑Decomposed Influence (SDI): A novel extension of the TracIn influence estimator that produces a length‑τ trajectory, assigning a separate influence score to each loop iteration.
- TensorSketch‑based implementation: Enables SDI to run at transformer scale without ever materialising per‑example gradients, dramatically reducing memory and compute overhead.
- Empirical validation on looped GPT‑style models: Demonstrates that SDI matches full‑gradient baselines (≤ 5 % error) while scaling to billions of parameters.
- Broad applicability: Shows how SDI can be used for data debugging, curriculum design, and probing the latent reasoning steps of algorithmic tasks (e.g., sorting, parity).
Methodology
- Unroll the recurrent graph: The shared transformer block is applied τ times, creating τ distinct “steps.”
- Decompose TracIn: Traditional TracIn computes a dot‑product between the gradient of a test example and the summed gradients of a training example across all training steps. SDI replaces the sum with a vector of τ partial sums, yielding an influence trajectory ([I_1, I_2, …, I_τ]).
- TensorSketch compression: Instead of storing each per‑example gradient (which would be prohibitive), the authors hash‑project gradients into a low‑dimensional sketch using the TensorSketch algorithm. The sketches are additive, so the step‑wise influence can be recovered by simple inner products in sketch space.
- Evaluation pipeline:
- Train looped transformer models on synthetic algorithmic datasets (e.g., copy, addition, sorting).
- Compute SDI for a set of test queries and a pool of training examples.
- Compare against a full‑gradient baseline (exact per‑example gradients) and classic TracIn.
Results & Findings
| Metric | SDI (sketch) | Full‑gradient baseline | Classic TracIn |
|---|---|---|---|
| Mean absolute error (influence) | 0.04 | – | 0.31 |
| Memory usage (per‑example) | ≈ 0.2 % of full grads | 100 % | 100 % |
| Runtime overhead (training + attribution) | 1.3× training time | – | 1.9× |
| Correlation with ground‑truth “critical” examples (algorithmic tasks) | 0.87 | 0.89 | 0.62 |
- Step‑wise insight: For a sorting task, the highest influence spikes appear exactly at the iteration where the model performs the “compare‑swap” operation, confirming that SDI pinpoints the reasoning phase.
- Scalability: Experiments on a 1.3 B‑parameter looped GPT‑style model (τ = 12) run on a single 8‑GPU node, whereas the full‑gradient baseline would require > 200 GB of GPU memory.
Practical Implications
- Debugging training data: Developers can now ask “Which training examples caused the model to fail on this specific query, and at which reasoning step?” This is invaluable for spotting mislabeled or adversarial examples that only affect later reasoning stages.
- Curriculum learning: By observing the step‑wise influence profile, one can schedule training examples that teach early reasoning steps first, then progressively introduce examples that matter later, potentially accelerating convergence.
- Model auditing & compliance: Regulatory frameworks increasingly demand traceability of model decisions. SDI provides a concrete audit trail linking a decision back to specific data points and the exact internal computation step.
- Improved probing tools: Researchers building probing classifiers for latent reasoning can now condition probes on the step where influence peaks, yielding cleaner, more interpretable signals.
Limitations & Future Work
- Assumption of a fixed loop count (τ): SDI’s trajectory length equals the number of unrolled steps; models that adaptively decide when to stop (e.g., early‑exit transformers) would need a dynamic handling.
- Sketch approximation error: While negligible in the reported experiments, the TensorSketch introduces bias that could become significant for extremely deep loops (τ ≫ 20) or when gradients are highly sparse.
- Focus on synthetic algorithmic tasks: Real‑world NLP benchmarks (e.g., code generation, dialogue) were not evaluated; extending SDI to those domains is an open direction.
- Integration with existing tooling: Current implementations are research‑prototype; packaging SDI as a plug‑in for popular libraries (PyTorch‑Lightning, Hugging Face) would lower adoption barriers.
Bottom line: Step‑Decomposed Influence opens a new window onto the inner life of looped transformers, giving developers the ability to trace when a training example matters. With its scalable sketch‑based engine, it bridges the gap between academic interpretability research and practical, production‑grade model debugging.
Authors
- Georgios Kaissis
- David Mildenberger
- Juan Felipe Gomez
- Martin J. Menten
- Eleni Triantafillou
Paper Information
- arXiv ID: 2602.10097v1
- Categories: cs.LG, cs.AI
- Published: February 10, 2026
- PDF: Download PDF