[Paper] On the 'Induction Bias' in Sequence Models
Source: arXiv - 2602.18333v1
Overview
Transformers have taken the spotlight in natural‑language processing, but recent studies suggest they struggle with state tracking—the ability to keep a consistent internal representation of a sequence as it evolves. This paper dives into that problem inside the training distribution (i.e., not just on out‑of‑distribution tests) and compares transformers to classic recurrent neural networks (RNNs) on how efficiently they learn to track state across varying sequence lengths and state‑space sizes.
Key Contributions
- Large‑scale empirical comparison of transformers vs. RNNs on data‑efficiency for state‑tracking tasks across many supervision regimes.
- Quantitative evidence that transformers need dramatically more training data as the state space or sequence length grows, while RNNs scale much more gracefully.
- Analysis of weight sharing across sequence lengths, showing transformers learn almost length‑specific solutions, whereas RNNs naturally amortize learning across lengths.
- Insightful diagnostic tools (e.g., cross‑length generalization curves, parameter‑reuse metrics) that can be reused by practitioners to probe their own models.
Methodology
- Synthetic state‑tracking benchmarks – the authors generate controlled tasks where a hidden “state” evolves deterministically (e.g., a counter, a finite‑state machine) and must be inferred from the observed sequence.
- Varying difficulty axes – they systematically increase (a) the size of the hidden state space and (b) the maximum sequence length, creating a grid of 𝑁 × L conditions.
- Model families – two canonical architectures are evaluated:
- Transformer encoder (standard multi‑head self‑attention, positional encodings).
- RNN (GRU/LSTM variants).
- Supervision regimes – from full supervision (state label at every timestep) to sparse supervision (label only at the final step).
- Data‑efficiency measurement – for each condition they train models on progressively larger subsets of the training data and record the smallest dataset size that reaches a predefined accuracy threshold.
- Cross‑length weight‑sharing analysis – after training on a set of lengths, they test the same weights on unseen lengths and compute performance degradation, quantifying how much knowledge transfers across lengths.
Results & Findings
| Aspect | Transformers | RNNs |
|---|---|---|
| Training data needed | Grows super‑linearly with state‑space size and length; e.g., 10× more data when state space doubles. | Grows sub‑linearly; often only a modest increase (≈1.2×) for the same change. |
| Cross‑length generalization | Near‑zero transfer; models trained on length = 10 perform poorly on length = 20 unless retrained. | Strong transfer; a model trained on short sequences improves performance on longer ones without additional training. |
| Weight sharing | Negligible; attention heads learn length‑specific patterns, sometimes even harming performance on other lengths. | Intrinsic sharing via recurrent weights; the same transition matrix is reused across all timesteps. |
| Effect of supervision sparsity | Data‑efficiency gap widens under sparse supervision. | RNNs remain relatively robust. |
In short, even when the test distribution matches the training distribution, transformers exhibit a fundamental inefficiency in learning to track state, relying on memorizing length‑specific tricks rather than building a unified, amortized representation.
Practical Implications
- Model selection for sequential reasoning – For tasks that require explicit state tracking (e.g., parsing, program execution, dialogue state management), RNN‑style recurrence may still be the more data‑efficient choice, especially when training data is limited.
- Designing better transformers – The findings motivate architectural tweaks that encourage length‑agnostic representations, such as incorporating recurrence, relative positional encodings, or explicit memory modules.
- Curriculum learning – Since transformers struggle to share knowledge across lengths, a curriculum that gradually increases sequence length could mitigate data‑efficiency issues.
- Benchmarking – Developers should augment standard NLP benchmarks with controlled state‑tracking probes to catch hidden weaknesses that OOD tests might miss.
- Resource budgeting – When planning large‑scale pre‑training, expect transformers to need substantially more examples if the downstream task involves long‑range state dependencies (e.g., code generation, long document summarization).
Limitations & Future Work
- Synthetic tasks – While they provide clean insight, real‑world data may contain additional structure that helps transformers generalize.
- Model variants – Only vanilla transformers and standard GRU/LSTM cells were examined; newer architectures (e.g., Performer, Recurrent Transformers) could behave differently.
- Scale – Experiments were run on moderate model sizes; it remains open whether scaling up (more layers, larger hidden dimensions) alleviates the data‑efficiency gap.
- Theoretical analysis – The paper offers empirical evidence but stops short of a formal characterization of the “induction bias” that causes length‑specific learning.
Future research could explore hybrid models that blend attention with recurrence, investigate alternative positional encodings that promote length invariance, and extend the analysis to real‑world sequential tasks such as code synthesis or multi‑turn dialogue.
Authors
- M. Reza Ebrahimi
- Michaël Defferrard
- Sunny Panchal
- Roland Memisevic
Paper Information
- arXiv ID: 2602.18333v1
- Categories: cs.LG, cs.CL
- Published: February 20, 2026
- PDF: Download PDF