[Paper] Superposition Is Not Necessary: A Mechanistic Interpretability Analysis of Transformer Representations for Time Series Forecasting
Source: arXiv - 2605.05151v1
Overview
This paper investigates why transformer models work so well for time‑series forecasting, a domain where much simpler linear models (e.g., DLinear) are already strong competitors. By probing the internal activations of a state‑of‑the‑art transformer (PatchTST) with sparse autoencoders, the author shows that the network does not rely on the dense, superimposed representations that are thought to power transformers in natural‑language processing. In other words, the “magic” of transformers for forecasting may be far less mysterious—and far less necessary—than previously believed.
Key Contributions
- Empirical baseline: Demonstrates that a single‑layer, low‑dimensional transformer matches the forecasting accuracy of deeper, wider variants on standard benchmarks.
- Mechanistic probing: Applies sparse autoencoders (SAEs) to the post‑GELU feed‑forward network (FFN) activations of PatchTST, exploring dictionary sizes from 0.5× to 4× the native hidden dimension.
- Superposition analysis: Finds that expanding the SAE dictionary yields virtually no change in downstream performance (average +0.214 %) and that many over‑complete latent units stay inactive.
- Causal intervention study: Performs targeted manipulations of the dominant latent features; the resulting forecasts barely shift, indicating that the model’s predictions are not tightly coupled to any single latent direction.
- Interpretability insight: Concludes that the transformer’s internal representations for time‑series data are sparse and stable, contradicting the hypothesis that strong superposition (dense compositional encoding) is required for high performance.
Methodology
- Model selection: The author uses PatchTST, a transformer‑based architecture that processes time‑series patches similarly to image patches. A stripped‑down version (one transformer layer, modest hidden size) is trained on several public forecasting datasets.
- Activation collection: After training, the intermediate activations right after the GELU non‑linearity in each FFN block are extracted. These vectors are the “raw thoughts” of the transformer before they are linearly mixed again.
- Sparse autoencoder training: For each activation set, a sparse autoencoder is trained with a dictionary (latent space) of varying size relative to the original hidden dimension (e.g., 0.5×, 1×, 2×, 4×). The SAE learns a compact, sparse code that can reconstruct the original activation with minimal error.
- Dictionary analysis: The author measures how many latent units become active, how reconstruction error changes with dictionary size, and whether larger dictionaries improve downstream forecasting when the SAE‑encoded features replace the original activations.
- Causal interventions: By zero‑ing out or perturbing the most active latent dimensions in the SAE code, the study observes the effect on the final forecast, quantifying the causal influence of each latent factor.
All steps are implemented with standard PyTorch tooling, making the pipeline reproducible for developers familiar with deep‑learning workflows.
Results & Findings
| Experiment | Observation |
|---|---|
| Single‑layer vs. deep transformer | Forecasting error differences < 0.3 % across all datasets – the shallow model is essentially as good as the deep one. |
| Dictionary scaling (0.5× → 4×) | Average downstream performance change = +0.214 % (statistically insignificant). Over‑complete dictionaries contain many dead units (> 30 % inactive). |
| Latent sparsity | Even with a 4× dictionary, the average activation sparsity stays around 10 % (i.e., only a few latent neurons fire per time step). |
| Causal intervention | Zeroing the top‑5 latent dimensions changes MAE/RMSE by < 0.05 % on average – forecasts are remarkably robust to such manipulations. |
| Superposition test | No evidence that the model’s predictions depend on a dense superposition of many latent features; instead, a handful of stable, sparse codes dominate. |
These findings collectively argue that the transformer’s success on typical forecasting benchmarks does not stem from the rich, compositional representations that are central to language modeling.
Practical Implications
- Model simplification: Developers can confidently deploy much smaller transformer variants (even a single layer) for many forecasting tasks, reducing memory footprint and inference latency.
- Hardware efficiency: Sparse representations mean that quantization or pruning techniques could be applied aggressively without sacrificing accuracy, enabling deployment on edge devices or low‑power servers.
- Hybrid pipelines: Since the representations are not heavily superposed, coupling a lightweight transformer front‑end with a classic linear head (e.g., DLinear) may capture the best of both worlds—fast training, interpretability, and competitive accuracy.
- Tooling for debugging: Sparse autoencoders can become a diagnostic tool in production pipelines, allowing engineers to monitor which latent features are active and flag anomalies when unexpected patterns emerge.
- Benchmark design: The results suggest that current public forecasting datasets may be “too easy” for testing the full expressive power of transformers. Practitioners seeking to push the envelope should consider more challenging, multi‑scale, or irregularly sampled time‑series data.
Limitations & Future Work
- Dataset scope: The study focuses on standard, well‑curated benchmarks (e.g., ETTh, ETTm, Weather). Results may differ on highly noisy, irregular, or multivariate streams common in industry (e.g., IoT sensor networks).
- Model family: Only PatchTST’s FFN activations were probed; other transformer variants (e.g., attention‑only, Performer) might exhibit different internal dynamics.
- Intervention granularity: The causal tests perturb latent dimensions in isolation; more complex, coordinated interventions could reveal hidden dependencies.
- Scalability of SAEs: Training sparse autoencoders on massive, high‑frequency streams could become computationally expensive; future work could explore online or streaming SAE variants.
- Beyond forecasting: Extending the mechanistic analysis to related tasks (anomaly detection, imputation, reinforcement‑learning‑based control) would test whether the lack of superposition holds more broadly.
Bottom line for developers: You don’t need a deep, heavily‑parameterized transformer to get state‑of‑the‑art forecasts. A lean, sparsely‑activated model can deliver the same performance, opening the door to faster, cheaper, and more interpretable time‑series solutions.
Authors
- Alper Yıldırım
Paper Information
- arXiv ID: 2605.05151v1
- Categories: cs.LG, cs.AI
- Published: May 6, 2026
- PDF: Download PDF