[Paper] Mechanistic Interpretability for Transformer-based Time Series Classification
Source: arXiv - 2511.21514v1
Overview
Transformers have taken the lead in time‑series classification, but their black‑box nature makes it hard for engineers to trust or debug them. This paper adapts a suite of mechanistic interpretability tools—originally built for NLP—to peel back the layers of transformer models that operate on sequential sensor data, revealing how and where the model makes its decisions.
Key Contributions
- Cross‑domain adaptation: Ported activation‑patching, attention‑saliency, and sparse autoencoder techniques from language models to time‑series transformers.
- Causal head‑level analysis: Systematically probed individual attention heads and specific timesteps to map their causal impact on the final classification.
- Internal causal graphs: Built visual graphs that trace information flow through the network, pinpointing the most influential heads and temporal positions.
- Interpretable latent features: Demonstrated that sparse autoencoders can extract compact, human‑readable representations of the model’s internal state.
- Benchmark validation: Applied the methodology to a widely‑used time‑series classification benchmark, showing that the interpretability pipeline scales to realistic datasets.
Methodology
- Model & Dataset – The authors trained a standard Vision‑Transformer‑style architecture on the UCR/UEA time‑series classification benchmark (e.g., the “ElectricDevices” dataset).
- Activation Patching – They intervened on hidden activations: for a given test sample, they swapped the activation of a specific head/timestep with that from a reference (correctly classified) sample and measured the change in output probability. This quantifies the causal contribution of that component.
- Attention Saliency – By computing gradients of the loss w.r.t. attention scores, they produced heatmaps that highlight which head‑time‑step pairs the model is most sensitive to.
- Sparse Autoencoders – A lightweight autoencoder was trained on the transformer’s intermediate activations, with a strong sparsity penalty. The resulting latent dimensions correspond to distinct, reusable patterns (e.g., “spike‑detector” or “trend‑matcher”).
- Causal Graph Construction – Combining the patching results and saliency maps, they assembled directed graphs where nodes are heads/timesteps and edges encode measured causal influence, offering a high‑level view of information propagation.
Results & Findings
- Head importance hierarchy: A small subset (≈ 10 % of heads) accounted for > 70 % of the model’s predictive power; these heads consistently attended to early‑time steps that contain discriminative motifs.
- Temporal hotspots: Certain timesteps (often the onset of a pattern) were repeatedly identified as causal pivots across multiple classes.
- Sparse latent semantics: The autoencoder’s top latent units aligned with intuitive signal characteristics—e.g., one unit activated on sharp peaks, another on gradual ramps—providing a human‑readable dictionary of features the transformer uses.
- Performance parity: Adding the interpretability pipeline did not degrade classification accuracy (within 0.2 % of the baseline), confirming that the analysis is non‑intrusive.
- Causal graphs matched known domain knowledge for several datasets (e.g., the “ECG200” dataset’s QRS complex), suggesting the method surfaces genuine signal reasoning rather than spurious correlations.
Practical Implications
- Debugging & model auditing: Engineers can now locate the exact head or timestep responsible for a misclassification, enabling targeted retraining or architecture tweaks.
- Feature engineering shortcuts: The sparse latent features can be exported as lightweight, explainable embeddings for downstream tasks (e.g., anomaly detection) without running the full transformer.
- Regulatory compliance: For industries like healthcare or finance where model transparency is mandated, the causal graphs provide concrete evidence of decision pathways.
- Model compression: Knowing which heads are dispensable opens the door to pruning strategies that shrink model size while preserving accuracy—useful for edge‑device deployments.
- Cross‑domain transfer: The same interpretability toolbox can be applied to any transformer handling sequential data (audio, IoT streams, log files), accelerating trust‑building across domains.
Limitations & Future Work
- Dataset scope: Experiments were limited to a single benchmark suite; broader validation on multivariate, irregularly sampled, or streaming time‑series is needed.
- Scalability of patching: Activation patching grows quadratically with the number of heads and timesteps, making it expensive for very deep or long‑sequence models. Approximate or hierarchical patching strategies are a promising direction.
- Autoencoder interpretability: While latent units showed semantic patterns, a systematic mapping to domain‑specific concepts remains manual; integrating supervised probing could automate this.
- Real‑time applicability: The current pipeline is offline; future work should explore lightweight, on‑the‑fly interpretability for live monitoring systems.
Bottom line: By bringing mechanistic interpretability to transformer‑based time‑series classifiers, the authors give developers a practical lens to see inside these powerful models, paving the way for more trustworthy, efficient, and domain‑aware AI systems.
Authors
- Matīss Kalnāre
- Sofoklis Kitharidis
- Thomas Bäck
- Niki van Stein
Paper Information
- arXiv ID: 2511.21514v1
- Categories: cs.LG, cs.AI
- Published: November 26, 2025
- PDF: Download PDF