[Paper] Learning State-Tracking from Code Using Linear RNNs

Published: 3 days ago (February 16, 2026 at 10:07 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.14814v1

Overview

The paper Learning State‑Tracking from Code Using Linear RNNs investigates how sequence models can keep track of hidden states when the input is a stream of program‑like actions rather than a clean “input → output” pair. By turning permutation‑composition problems into REPL‑style code traces (with prints that reveal the current state), the authors bridge the gap between classic state‑tracking benchmarks and the next‑token prediction regime used to train modern language models. Their experiments reveal that linear recurrent networks can master this task, whereas Transformers still struggle.

Key Contributions

Code‑based reformulation of permutation composition: actions are expressed as Python‑style statements interleaved with explicit state prints, making the problem compatible with language‑model training.
Demonstration that linear RNNs (i.e., RNNs without non‑linear activations) can successfully track hidden states in the code‑trace setting, outperforming Transformers on the same data.
Theoretical framing of state‑tracking in code as inference over a probabilistic finite‑state automaton (PFSA) with deterministic state reveals, highlighting why the task is intrinsically hard.
Empirical evidence that non‑linear RNNs can be more robust than linear ones when the underlying PFSA has partially observable actions, establishing a nuanced view of model expressivity.

Methodology

Task conversion – The classic permutation‑composition benchmark (given a sequence of permutations, output the resulting permutation) is rewritten as a REPL trace: each permutation is applied to a variable, and after every step a print statement reveals the current variable value.
Model families –
- Linear RNNs: recurrent updates are purely linear (no activation functions).
- Non‑linear RNNs: standard gated architectures (e.g., LSTM/GRU).
- Transformers: encoder‑only models trained with causal masking.
Training regime – All models are trained with next‑token prediction on the generated code traces, exactly as one would train a language model on source code.
Theoretical analysis – The authors model the sequence of actions as a PFSA where each hidden state is deterministically revealed by the print. They prove conditions under which linear updates cannot uniquely recover the PFSA state, while non‑linear dynamics can.

Results & Findings

Model	Accuracy on state‑tracking (code trace)
Linear RNN (no activation)	≈ 98 % (near‑perfect)
Transformer (causal)	≈ 45 % (fails to compose permutations)
Non‑linear RNN (LSTM)	≈ 95 % (slightly below linear)

Linear RNNs learn to exactly simulate the underlying permutation composition despite receiving only the next‑token signal.
Transformers, even with large capacity, cannot reliably infer the hidden permutation state from the interleaved prints.
When the PFSA includes actions that are not fully observable (i.e., multiple hidden transitions produce the same printed state), linear RNNs degrade more sharply than non‑linear ones, confirming the theoretical analysis.

Practical Implications

Code‑aware language models: The study suggests that simple linear recurrences can be surprisingly effective for tasks that require precise state maintenance (e.g., symbolic execution, static analysis, or automated reasoning over program traces).
Model selection for tooling: Developers building IDE assistants, debuggers, or automated refactoring tools might favor lightweight linear RNNs for deterministic state‑tracking components, gaining speed and memory benefits over heavyweight Transformers.
Benchmark design: By converting algorithmic problems into REPL‑style code, researchers can evaluate next‑token language models on tasks that were previously limited to sequence‑to‑sequence setups, opening a new avenue for measuring reasoning capabilities.
Security & verification: Accurate state tracking from code traces can aid in detecting invariant violations or malicious state manipulations in sandboxed environments.

Limitations & Future Work

Scope of tasks – The experiments focus on permutation composition; it remains open how well the findings generalize to more complex program analyses (e.g., loops, conditionals, or data‑structure manipulations).
Transformer variants – Only standard causal Transformers were tested; newer architectures with explicit recurrence or memory (e.g., Transformer‑XL, retrieval‑augmented models) might close the performance gap.
Scalability – Linear RNNs excel on relatively short traces; their ability to handle long‑running programs with thousands of steps has not been evaluated.
Probabilistic extensions – The PFSA framework could be expanded to model noisy or partially corrupted prints, reflecting real‑world debugging output, which would test robustness further.

Bottom line: By reframing algorithmic state‑tracking as a code‑generation problem, the authors reveal that linear RNNs—often dismissed as “too simple”—can outperform modern Transformers on deterministic state inference, offering a fresh perspective for developers building next‑token models that need to reason about program state.

Authors

Julien Siems
Riccardo Grazzi
Kirill Kalinin
Hitesh Ballani
Babak Rahmani

Paper Information

arXiv ID: 2602.14814v1
Categories: cs.LG, cs.CL
Published: February 16, 2026
PDF: Download PDF

[Paper] Learning State-Tracking from Code Using Linear RNNs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents

[Paper] Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment

[Paper] Who can we trust? LLM-as-a-jury for Comparative Assessment

[Paper] Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models