[Paper] Step-resolved data attribution for looped transformers

Published: 2 months ago (February 10, 2026 at 01:57 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.10097v1

Overview

The paper “Step‑resolved data attribution for looped transformers” tackles a blind spot in modern interpretability tools: they tell you which training examples mattered, but not when during a model’s recurrent reasoning they mattered. By unrolling the computation of looped (recurrent) transformers—e.g., GPT‑style models that apply the same block τ times—the authors introduce a fine‑grained influence estimator that reveals the exact iteration at which a training example exerts its effect.

Key Contributions

Step‑Decomposed Influence (SDI): A novel extension of the TracIn influence estimator that produces a length‑τ trajectory, assigning a separate influence score to each loop iteration.
TensorSketch‑based implementation: Enables SDI to run at transformer scale without ever materialising per‑example gradients, dramatically reducing memory and compute overhead.
Empirical validation on looped GPT‑style models: Demonstrates that SDI matches full‑gradient baselines (≤ 5 % error) while scaling to billions of parameters.
Broad applicability: Shows how SDI can be used for data debugging, curriculum design, and probing the latent reasoning steps of algorithmic tasks (e.g., sorting, parity).

Methodology

Unroll the recurrent graph: The shared transformer block is applied τ times, creating τ distinct “steps.”
Decompose TracIn: Traditional TracIn computes a dot‑product between the gradient of a test example and the summed gradients of a training example across all training steps. SDI replaces the sum with a vector of τ partial sums, yielding an influence trajectory ([I_1, I_2, …, I_τ]).
TensorSketch compression: Instead of storing each per‑example gradient (which would be prohibitive), the authors hash‑project gradients into a low‑dimensional sketch using the TensorSketch algorithm. The sketches are additive, so the step‑wise influence can be recovered by simple inner products in sketch space.
Evaluation pipeline:
- Train looped transformer models on synthetic algorithmic datasets (e.g., copy, addition, sorting).
- Compute SDI for a set of test queries and a pool of training examples.
- Compare against a full‑gradient baseline (exact per‑example gradients) and classic TracIn.

Results & Findings

Metric	SDI (sketch)	Full‑gradient baseline	Classic TracIn
Mean absolute error (influence)	0.04	–	0.31
Memory usage (per‑example)	≈ 0.2 % of full grads	100 %	100 %
Runtime overhead (training + attribution)	1.3× training time	–	1.9×
Correlation with ground‑truth “critical” examples (algorithmic tasks)	0.87	0.89	0.62

Step‑wise insight: For a sorting task, the highest influence spikes appear exactly at the iteration where the model performs the “compare‑swap” operation, confirming that SDI pinpoints the reasoning phase.
Scalability: Experiments on a 1.3 B‑parameter looped GPT‑style model (τ = 12) run on a single 8‑GPU node, whereas the full‑gradient baseline would require > 200 GB of GPU memory.

Practical Implications

Debugging training data: Developers can now ask “Which training examples caused the model to fail on this specific query, and at which reasoning step?” This is invaluable for spotting mislabeled or adversarial examples that only affect later reasoning stages.
Curriculum learning: By observing the step‑wise influence profile, one can schedule training examples that teach early reasoning steps first, then progressively introduce examples that matter later, potentially accelerating convergence.
Model auditing & compliance: Regulatory frameworks increasingly demand traceability of model decisions. SDI provides a concrete audit trail linking a decision back to specific data points and the exact internal computation step.
Improved probing tools: Researchers building probing classifiers for latent reasoning can now condition probes on the step where influence peaks, yielding cleaner, more interpretable signals.

Limitations & Future Work

Assumption of a fixed loop count (τ): SDI’s trajectory length equals the number of unrolled steps; models that adaptively decide when to stop (e.g., early‑exit transformers) would need a dynamic handling.
Sketch approximation error: While negligible in the reported experiments, the TensorSketch introduces bias that could become significant for extremely deep loops (τ ≫ 20) or when gradients are highly sparse.
Focus on synthetic algorithmic tasks: Real‑world NLP benchmarks (e.g., code generation, dialogue) were not evaluated; extending SDI to those domains is an open direction.
Integration with existing tooling: Current implementations are research‑prototype; packaging SDI as a plug‑in for popular libraries (PyTorch‑Lightning, Hugging Face) would lower adoption barriers.

Bottom line: Step‑Decomposed Influence opens a new window onto the inner life of looped transformers, giving developers the ability to trace when a training example matters. With its scalable sketch‑based engine, it bridges the gap between academic interpretability research and practical, production‑grade model debugging.

Authors

Georgios Kaissis
David Mildenberger
Juan Felipe Gomez
Martin J. Menten
Eleni Triantafillou

Paper Information

arXiv ID: 2602.10097v1
Categories: cs.LG, cs.AI
Published: February 10, 2026
PDF: Download PDF

[Paper] Step-resolved data attribution for looped transformers

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos

[Paper] Semantic Chunking and the Entropy of Natural Language

[Paper] CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

[Paper] Selection of CMIP6 Models for Regional Precipitation Projection and Climate Change Assessment in the Jhelum and Chenab River Basins