[Paper] On the 'Induction Bias' in Sequence Models

Published: 3 days ago (February 20, 2026 at 11:39 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.18333v1

Overview

Transformers have taken the spotlight in natural‑language processing, but recent studies suggest they struggle with state tracking—the ability to keep a consistent internal representation of a sequence as it evolves. This paper dives into that problem inside the training distribution (i.e., not just on out‑of‑distribution tests) and compares transformers to classic recurrent neural networks (RNNs) on how efficiently they learn to track state across varying sequence lengths and state‑space sizes.

Key Contributions

Large‑scale empirical comparison of transformers vs. RNNs on data‑efficiency for state‑tracking tasks across many supervision regimes.
Quantitative evidence that transformers need dramatically more training data as the state space or sequence length grows, while RNNs scale much more gracefully.
Analysis of weight sharing across sequence lengths, showing transformers learn almost length‑specific solutions, whereas RNNs naturally amortize learning across lengths.
Insightful diagnostic tools (e.g., cross‑length generalization curves, parameter‑reuse metrics) that can be reused by practitioners to probe their own models.

Methodology

Synthetic state‑tracking benchmarks – the authors generate controlled tasks where a hidden “state” evolves deterministically (e.g., a counter, a finite‑state machine) and must be inferred from the observed sequence.
Varying difficulty axes – they systematically increase (a) the size of the hidden state space and (b) the maximum sequence length, creating a grid of 𝑁 × L conditions.
Model families – two canonical architectures are evaluated:
- Transformer encoder (standard multi‑head self‑attention, positional encodings).
- RNN (GRU/LSTM variants).
Supervision regimes – from full supervision (state label at every timestep) to sparse supervision (label only at the final step).
Data‑efficiency measurement – for each condition they train models on progressively larger subsets of the training data and record the smallest dataset size that reaches a predefined accuracy threshold.
Cross‑length weight‑sharing analysis – after training on a set of lengths, they test the same weights on unseen lengths and compute performance degradation, quantifying how much knowledge transfers across lengths.

Results & Findings

Aspect	Transformers	RNNs
Training data needed	Grows super‑linearly with state‑space size and length; e.g., 10× more data when state space doubles.	Grows sub‑linearly; often only a modest increase (≈1.2×) for the same change.
Cross‑length generalization	Near‑zero transfer; models trained on length = 10 perform poorly on length = 20 unless retrained.	Strong transfer; a model trained on short sequences improves performance on longer ones without additional training.
Weight sharing	Negligible; attention heads learn length‑specific patterns, sometimes even harming performance on other lengths.	Intrinsic sharing via recurrent weights; the same transition matrix is reused across all timesteps.
Effect of supervision sparsity	Data‑efficiency gap widens under sparse supervision.	RNNs remain relatively robust.

In short, even when the test distribution matches the training distribution, transformers exhibit a fundamental inefficiency in learning to track state, relying on memorizing length‑specific tricks rather than building a unified, amortized representation.

Practical Implications

Model selection for sequential reasoning – For tasks that require explicit state tracking (e.g., parsing, program execution, dialogue state management), RNN‑style recurrence may still be the more data‑efficient choice, especially when training data is limited.
Designing better transformers – The findings motivate architectural tweaks that encourage length‑agnostic representations, such as incorporating recurrence, relative positional encodings, or explicit memory modules.
Curriculum learning – Since transformers struggle to share knowledge across lengths, a curriculum that gradually increases sequence length could mitigate data‑efficiency issues.
Benchmarking – Developers should augment standard NLP benchmarks with controlled state‑tracking probes to catch hidden weaknesses that OOD tests might miss.
Resource budgeting – When planning large‑scale pre‑training, expect transformers to need substantially more examples if the downstream task involves long‑range state dependencies (e.g., code generation, long document summarization).

Limitations & Future Work

Synthetic tasks – While they provide clean insight, real‑world data may contain additional structure that helps transformers generalize.
Model variants – Only vanilla transformers and standard GRU/LSTM cells were examined; newer architectures (e.g., Performer, Recurrent Transformers) could behave differently.
Scale – Experiments were run on moderate model sizes; it remains open whether scaling up (more layers, larger hidden dimensions) alleviates the data‑efficiency gap.
Theoretical analysis – The paper offers empirical evidence but stops short of a formal characterization of the “induction bias” that causes length‑specific learning.

Future research could explore hybrid models that blend attention with recurrence, investigate alternative positional encodings that promote length invariance, and extend the analysis to real‑world sequential tasks such as code synthesis or multi‑turn dialogue.

Authors

M. Reza Ebrahimi
Michaël Defferrard
Sunny Panchal
Roland Memisevic

Paper Information

arXiv ID: 2602.18333v1
Categories: cs.LG, cs.CL
Published: February 20, 2026
PDF: Download PDF

[Paper] On the 'Induction Bias' in Sequence Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Subgroups of $U(d)$ Induce Natural RNN and Transformer Architectures

[Paper] Validating Political Position Predictions of Arguments

[Paper] Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System

[Paper] VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean