[Paper] Subjective Depth and Timescale Transformers: Learning Where and When to Compute

Published: 2 months ago (November 26, 2025 at 09:00 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.21408v1

Overview

Transformers have become the workhorse of modern AI, but their “one‑size‑fits‑all” compute pattern—every token attends to every other token in every layer—can be wasteful, especially for long sequences or massive models. This paper proposes two new Transformer variants that learn when and where to spend computation, cutting down on unnecessary work while preserving performance.

Key Contributions

Subjective Depth Transformers (SDT) – Introduces alternating Decision and Dynamic layers that use Bayesian surprise to decide which tokens need a full‑fidelity transformer block and which can be processed with a cheap “prior” approximation.
Subjective Timescale Transformers (STT) – Extends the idea to the temporal dimension, letting a router skip or execute whole transformer blocks for each token based on a learned “change hypothesis.”
Bayesian surprise signals (Expected and Unexpected Change) serve as the gating criterion, providing a principled way to detect novelty vs. predictability in the data.
Static compute graph – Despite dynamic routing, the overall graph remains static, simplifying deployment on existing hardware and compiler stacks.
Efficiency gains – Experiments show up to 75 % reduction in self‑attention FLOPs and ≈50 % cut in KV‑cache usage per compute‑skipping layer, with only modest accuracy loss.
Empirical evidence of learning dynamics – The models gradually shift from novelty‑driven gating early in training to prediction‑driven gating later, mirroring theoretical expectations about surprise‑based processing.

Methodology

Decision Layer (SDT) – Computes two parallel representations for each token:
- A posterior (full transformer block) that captures rich context.
- A prior (lightweight linear projection) that serves as a cheap fallback.
  The layer also estimates a Bayesian surprise score for each token, measuring how much the posterior deviates from the prior.
Dynamic Layer (SDT) – Uses a fixed‑capacity Top‑K router that selects the K tokens with the highest surprise scores to receive the expensive posterior computation; the rest use the prior. Because the router’s selection is deterministic given the scores, the overall graph stays static.
Transition Network (STT) – Predicts a residual update for each token, forming a hypothesis about how the token’s representation will change over time.
Temporal Router (STT) – Compares the predicted change to the actual change (again via a surprise metric). If the token is deemed “stable,” the router bypasses the transformer block for that timestep, reusing the cached KV‑values; otherwise, it executes the block.
Training – Both architectures are trained end‑to‑end with standard language modeling objectives. The surprise‑based gates are differentiable (via straight‑through estimators), allowing the model to learn optimal routing policies from data.

Results & Findings

Model	Compute Reduction	KV‑Cache Reduction	Perplexity (relative)
Baseline Transformer	–	–	0.0 %
SDT (Depth gating)	~75 % fewer attention FLOPs	~50 % fewer KV entries	+2–3 %
STT (Timescale gating)	~70 % fewer attention FLOPs	~45 % fewer KV entries	+2–4 %

Surprise dynamics: Early epochs show high gating activity for novel tokens; later epochs the router settles into a pattern where only truly surprising inputs trigger expensive computation.
Accuracy trade‑off: The modest increase in perplexity demonstrates that a large chunk of the computation can be pruned without dramatically harming language modeling quality.
Hardware friendliness: Because the compute graph never changes shape, the models run efficiently on GPUs/TPUs without needing custom kernels.

Practical Implications

Cost‑effective inference: Deployments that serve long documents (e.g., legal contracts, codebases) can cut latency and GPU memory by skipping attention on predictable spans.
Scalable training: Training massive decoder‑only models on commodity hardware becomes more feasible when each batch spends less time on self‑attention.
Edge and mobile AI: The static‑graph design means the models can be compiled with existing toolchains (TensorRT, ONNX Runtime) and run on resource‑constrained devices while still handling variable‑length inputs.
Fine‑grained control: Developers can expose the surprise threshold as a runtime knob, trading off speed vs. quality on the fly (e.g., aggressive pruning for batch processing, conservative gating for interactive chat).
Foundation for adaptive APIs: Cloud providers could bill compute based on the actual work performed per request, aligning cost with model effort.

Limitations & Future Work

Modest accuracy gap: While the compute savings are impressive, the current implementations still incur a few percent perplexity degradation, which may be unacceptable for high‑stakes applications.
Surprise estimator overhead: Computing Bayesian surprise adds a small constant cost; optimizing this step (e.g., via quantization) is an open challenge.
Generalization to encoder‑decoder or multimodal models: The paper focuses on decoder‑only language models; extending the gating mechanisms to vision‑language or speech models remains unexplored.
Dynamic hardware support: Although the graph is static, real‑world gains depend on efficient Top‑K selection and cache management; tighter integration with hardware schedulers could unlock further speedups.
Long‑term training dynamics: The observed shift from novelty‑driven to prediction‑driven gating warrants deeper theoretical analysis and could inspire curriculum‑learning strategies.

Bottom line: By teaching Transformers to ask “Do I really need to compute this?” and answering with a Bayesian surprise signal, SDT and STT open a promising path toward smarter, cheaper, and more adaptable deep learning models—an exciting development for anyone building large‑scale AI systems.

Authors

Frederico Wieser
Martin Benfeghoul
Haitham Bou Ammar
Jun Wang
Zafeirios Fountas

Paper Information

arXiv ID: 2511.21408v1
Categories: cs.LG, cs.AI, cs.CL, cs.IT
Published: November 26, 2025
PDF: Download PDF

[Paper] Subjective Depth and Timescale Transformers: Learning Where and When to Compute

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation

[Paper] Towards Improving Interpretability of Language Model Generation through a Structured Knowledge Discovery Approach

[Paper] Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] MegaChat: A Synthetic Persian Q&amp;A Dataset for High-Quality Sales Chatbot Evaluation

[Paper] Towards Improving Interpretability of Language Model Generation through a Structured Knowledge Discovery Approach

[Paper] Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

[Paper] MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation