[Paper] Superposition Is Not Necessary: A Mechanistic Interpretability Analysis of Transformer Representations for Time Series Forecasting

Published: 4 days ago (May 6, 2026 at 01:23 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.05151v1

Overview

This paper investigates why transformer models work so well for time‑series forecasting, a domain where much simpler linear models (e.g., DLinear) are already strong competitors. By probing the internal activations of a state‑of‑the‑art transformer (PatchTST) with sparse autoencoders, the author shows that the network does not rely on the dense, superimposed representations that are thought to power transformers in natural‑language processing. In other words, the “magic” of transformers for forecasting may be far less mysterious—and far less necessary—than previously believed.

Key Contributions

Empirical baseline: Demonstrates that a single‑layer, low‑dimensional transformer matches the forecasting accuracy of deeper, wider variants on standard benchmarks.
Mechanistic probing: Applies sparse autoencoders (SAEs) to the post‑GELU feed‑forward network (FFN) activations of PatchTST, exploring dictionary sizes from 0.5× to 4× the native hidden dimension.
Superposition analysis: Finds that expanding the SAE dictionary yields virtually no change in downstream performance (average +0.214 %) and that many over‑complete latent units stay inactive.
Causal intervention study: Performs targeted manipulations of the dominant latent features; the resulting forecasts barely shift, indicating that the model’s predictions are not tightly coupled to any single latent direction.
Interpretability insight: Concludes that the transformer’s internal representations for time‑series data are sparse and stable, contradicting the hypothesis that strong superposition (dense compositional encoding) is required for high performance.

Methodology

Model selection: The author uses PatchTST, a transformer‑based architecture that processes time‑series patches similarly to image patches. A stripped‑down version (one transformer layer, modest hidden size) is trained on several public forecasting datasets.
Activation collection: After training, the intermediate activations right after the GELU non‑linearity in each FFN block are extracted. These vectors are the “raw thoughts” of the transformer before they are linearly mixed again.
Sparse autoencoder training: For each activation set, a sparse autoencoder is trained with a dictionary (latent space) of varying size relative to the original hidden dimension (e.g., 0.5×, 1×, 2×, 4×). The SAE learns a compact, sparse code that can reconstruct the original activation with minimal error.
Dictionary analysis: The author measures how many latent units become active, how reconstruction error changes with dictionary size, and whether larger dictionaries improve downstream forecasting when the SAE‑encoded features replace the original activations.
Causal interventions: By zero‑ing out or perturbing the most active latent dimensions in the SAE code, the study observes the effect on the final forecast, quantifying the causal influence of each latent factor.

All steps are implemented with standard PyTorch tooling, making the pipeline reproducible for developers familiar with deep‑learning workflows.

Results & Findings

Experiment	Observation
Single‑layer vs. deep transformer	Forecasting error differences < 0.3 % across all datasets – the shallow model is essentially as good as the deep one.
Dictionary scaling (0.5× → 4×)	Average downstream performance change = +0.214 % (statistically insignificant). Over‑complete dictionaries contain many dead units (> 30 % inactive).
Latent sparsity	Even with a 4× dictionary, the average activation sparsity stays around 10 % (i.e., only a few latent neurons fire per time step).
Causal intervention	Zeroing the top‑5 latent dimensions changes MAE/RMSE by < 0.05 % on average – forecasts are remarkably robust to such manipulations.
Superposition test	No evidence that the model’s predictions depend on a dense superposition of many latent features; instead, a handful of stable, sparse codes dominate.

These findings collectively argue that the transformer’s success on typical forecasting benchmarks does not stem from the rich, compositional representations that are central to language modeling.

Practical Implications

Model simplification: Developers can confidently deploy much smaller transformer variants (even a single layer) for many forecasting tasks, reducing memory footprint and inference latency.
Hardware efficiency: Sparse representations mean that quantization or pruning techniques could be applied aggressively without sacrificing accuracy, enabling deployment on edge devices or low‑power servers.
Hybrid pipelines: Since the representations are not heavily superposed, coupling a lightweight transformer front‑end with a classic linear head (e.g., DLinear) may capture the best of both worlds—fast training, interpretability, and competitive accuracy.
Tooling for debugging: Sparse autoencoders can become a diagnostic tool in production pipelines, allowing engineers to monitor which latent features are active and flag anomalies when unexpected patterns emerge.
Benchmark design: The results suggest that current public forecasting datasets may be “too easy” for testing the full expressive power of transformers. Practitioners seeking to push the envelope should consider more challenging, multi‑scale, or irregularly sampled time‑series data.

Limitations & Future Work

Dataset scope: The study focuses on standard, well‑curated benchmarks (e.g., ETTh, ETTm, Weather). Results may differ on highly noisy, irregular, or multivariate streams common in industry (e.g., IoT sensor networks).
Model family: Only PatchTST’s FFN activations were probed; other transformer variants (e.g., attention‑only, Performer) might exhibit different internal dynamics.
Intervention granularity: The causal tests perturb latent dimensions in isolation; more complex, coordinated interventions could reveal hidden dependencies.
Scalability of SAEs: Training sparse autoencoders on massive, high‑frequency streams could become computationally expensive; future work could explore online or streaming SAE variants.
Beyond forecasting: Extending the mechanistic analysis to related tasks (anomaly detection, imputation, reinforcement‑learning‑based control) would test whether the lack of superposition holds more broadly.

Bottom line for developers: You don’t need a deep, heavily‑parameterized transformer to get state‑of‑the‑art forecasts. A lean, sparsely‑activated model can deliver the same performance, opening the door to faster, cheaper, and more interpretable time‑series solutions.

Authors

Alper Yıldırım

Paper Information

arXiv ID: 2605.05151v1
Categories: cs.LG, cs.AI
Published: May 6, 2026
PDF: Download PDF

[Paper] Superposition Is Not Necessary: A Mechanistic Interpretability Analysis of Transformer Representations for Time Series Forecasting

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Normalizing Trajectory Models

[Paper] Zero-Shot Imagined Speech Decoding via Imagined-to-Listened MEG Mapping

[Paper] GRAPHLCP: Structure-Aware Localized Conformal Prediction on Graphs

[Paper] EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction