[Paper] Transformers with Selective Access to Early Representations
Source: arXiv - 2605.03953v1
Overview
A new family of Transformer models called Selective Access Transformers (SATFormer) shows that letting deeper layers “peek” at the very first‑layer representations—only when it actually helps—can boost performance without the usual memory and speed penalties. By treating early‑representation reuse as a context‑dependent retrieval problem, SATFormer outperforms both vanilla Transformers and earlier static‑residual tricks across a range of model sizes.
Key Contributions
- Selective gating mechanism that dynamically decides, for each token, head, and layer, how much of the first‑layer value matrix (V_1) should be injected.
- Memory‑efficient design: the gate is a lightweight scalar per head/layer, keeping the overall footprint comparable to a standard Transformer.
- Broad empirical gains: consistent improvements in validation loss and zero‑shot accuracy from 130 M to 1.3 B parameters, with the biggest jumps on retrieval‑heavy benchmarks (+≈1.5 % average).
- Interpretability insights: analysis of the learned gates reveals sparse, depth‑dependent, head‑specific, and task‑category patterns, confirming that the model learns when and where early information is useful.
- Open‑source implementation (GitHub link) that can be dropped into existing Transformer codebases with minimal changes.
Methodology
-
Baseline architecture – Start from a standard Transformer (pre‑norm, multi‑head self‑attention, residual connections).
-
Preserve the first‑layer value pathway – Keep the value projection from the very first layer, (V_1), available to all later layers.
-
Context‑dependent gate – For each downstream layer (l) and head (h), compute a scalar gate (g_{l,h}\in[0,1]) using a tiny feed‑forward network that takes the current hidden state as input.
-
Selective injection – The attention output for layer (l) becomes
$$\text{output}{l} = \text{Attention}(Q_l,K_l,V_l) ;+; g{l,h},\cdot, V_1$$
where the gate can completely shut off the early‑value contribution (0) or let it pass through (1), or anything in between.
-
Training – The whole system is trained end‑to‑end with the usual language‑model or classification loss; the gates are learned jointly with the rest of the parameters.
-
Efficiency tricks – Gates are broadcast across token positions, so the extra computation is just a few element‑wise multiplications, preserving throughput.
Results & Findings
| Model Size | Baseline (val loss) | Static‑Residual (val loss) | SATFormer (val loss) |
|---|---|---|---|
| 130 M | 2.31 | 2.28 | 2.22 |
| 350 M | 2.12 | 2.09 | 2.03 |
| 1.3 B | 1.94 | 1.91 | 1.84 |
- Zero‑shot accuracy on retrieval‑centric tasks (e.g., MS‑MARCO, Natural Questions) improves by ~1.5 percentage points over static residuals.
- Throughput stays within 2‑3 % of the vanilla Transformer, and GPU memory overhead is negligible (< 5 %).
- Gate analysis shows that early‑layer values are heavily used in the first few attention heads of middle layers for lexical‑heavy tokens, but fade for deeper layers handling higher‑level semantics—exactly the selective behavior the authors hypothesized.
Practical Implications
- Better retrieval‑augmented models – If you’re building a search or QA system that relies on pulling out exact token‑level cues, SATFormer can give you a noticeable accuracy bump without needing extra indexing structures.
- Plug‑and‑play upgrade – The gating module is a few lines of code; you can retrofit existing Transformer stacks (BERT, GPT, T5, etc.) and reap gains with almost no engineering overhead.
- Cost‑effective scaling – For large‑scale language models where memory is a premium, SATFormer offers a middle ground between the cheap static residual trick and heavyweight dense retrieval layers.
- Interpretability for debugging – The learned gate patterns can be visualized to understand which parts of the model still depend on low‑level lexical signals, helping with model introspection and bias analysis.
Limitations & Future Work
- Gate granularity – Current gates are shared across all token positions within a layer/head; finer‑grained (per‑token) gating could capture even more nuanced reuse but would increase memory.
- Task scope – The paper focuses on language modeling and retrieval‑heavy benchmarks; it remains to be seen how SATFormer fares on generation‑centric tasks (e.g., summarization, code synthesis).
- Training stability – Some very deep configurations exhibited occasional gate saturation (all‑zeros or all‑ones), requiring careful learning‑rate tuning.
- Future directions suggested include exploring multi‑layer early‑representation pools (not just the first layer), hierarchical gating, and applying the selective‑access idea to vision Transformers where early‑level texture cues may be similarly valuable.
Authors
- Skye Gunasekaran
- Téa Wright
- Rui‑Jie Zhu
- Jason Eshraghian
Paper Information
- arXiv ID: 2605.03953v1
- Categories: cs.LG, cs.CL
- Published: May 5, 2026
- PDF: Download PDF