[Paper] Pretraining Recurrent Networks without Recurrence

Published: 6 days ago (June 4, 2026 at 01:57 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.06479v1

Overview

Training recurrent neural networks (RNNs) requires assigning credit across long sequences of computations. Standard backpropagation through time (BPTT) addresses this problem poorly: it is sequential in time, limiting parallelism, and suffers from vanishing or exploding gradients, making long‑range associations difficult to learn.

We propose Supervised Memory Training (SMT), a method for training nonlinear RNNs that sidesteps recurrent credit propagation entirely by reducing RNN training to supervised learning on one‑step memory transition labels ((m_t, x_{t+1}) \rightarrow m_{t+1}).

SMT acquires these memory labels by training a Transformer‑based encoder on a predictive state objective—retaining only information from the past necessary to predict the future. By decoupling what to remember from how to update memory, SMT enables time‑parallel RNN training with a stable (O(1)) length gradient path between any two tokens—without ever unrolling the RNN. Experiments show that SMT outperforms BPTT when pretraining various RNN architectures on tasks like language modeling and pixel sequence modeling. SMT enables nonlinear RNNs to better capture long‑range dependencies and train in parallel, potentially unlocking the scaling of models that build temporal abstractions of past experience.

Key Contributions

cs.LG
cs.AI

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.LG.

Authors

Akarsh Kumar
Phillip Isola

Paper Information

arXiv ID: 2606.06479v1
Categories: cs.LG, cs.AI
Published: June 4, 2026
PDF: Download PDF

[Paper] Pretraining Recurrent Networks without Recurrence

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] How reliable are LLMs when it comes to playing dice?

[Paper] MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

[Paper] Sparse Subspace-to-Expert Sharing for Task-Agnostic Continual Learning

[Paper] Accelerated Decentralized Stochastic Gradient Descent for Strongly Convex Optimization