[Paper] Reinforced Fast Weights with Next-Sequence Prediction
Source: arXiv - 2602.16704v1
Overview
The paper introduces REFINE, a reinforcement‑learning (RL) framework that teaches fast‑weight language models to predict sequences of tokens rather than a single next token. By shifting from the conventional next‑token prediction (NTP) to a next‑sequence prediction (NSP) objective, REFINE helps fast‑weight architectures capture long‑range dependencies more reliably, closing the performance gap with attention‑based Transformers on tasks that require very long context windows.
Key Contributions
- NSP‑driven training for fast weights – Proposes a reinforcement‑learning pipeline that optimizes fast‑weight models on multi‑token rollouts, encouraging coherent long‑range representations.
- Entropy‑based token selection – Uses prediction entropy to pick “informative” positions in a context, focusing the RL signal where the model is most uncertain.
- Group Relative Policy Optimization (GRPO) – Introduces a stable policy‑gradient algorithm tailored to the grouped rollout structure of fast‑weight networks.
- Universal applicability – Demonstrates that REFINE can be injected at any stage of a model’s lifecycle: mid‑training, post‑training fine‑tuning, or even test‑time adaptation.
- Empirical gains across benchmarks – Shows consistent improvements on needle‑in‑a‑haystack retrieval, long‑context QA, and the comprehensive LongBench suite using two large fast‑weight backbones (LaCT‑760M and DeltaNet‑1.3B).
Methodology
- Fast‑weight backbone – The base model updates a set of “fast weights” on‑the‑fly as it reads tokens, allowing it to store contextual information with constant memory overhead.
- Entropy‑guided sampling – For a given input prefix, the model computes the entropy of its token‑level predictions. Positions with high entropy are flagged as informative because the model is unsure about them.
- Multi‑token rollouts – Starting from each selected position, the model generates a short rollout (e.g., 5–10 tokens) using its current fast‑weight dynamics.
- Self‑supervised rewards – After a rollout, a reward is computed by comparing the generated sequence to the ground‑truth continuation (e.g., using BLEU‑like n‑gram overlap or a learned similarity scorer). This reward reflects how well the model preserved semantic coherence across the whole rollout.
- GRPO optimization – The policy (the fast‑weight update rule) is updated with Group Relative Policy Optimization, a variant of PPO that treats each rollout as a group and normalizes advantages relative to the group’s baseline, stabilizing training.
- Training regimes – REFINE can be applied:
- Mid‑training – as an auxiliary objective alongside standard NTP.
- Post‑training – fine‑tuning a pre‑trained fast‑weight model.
- Test‑time – performing a few RL updates on the specific input batch before inference.
Results & Findings
| Model (size) | Baseline (NTP) | REFINE (NSP) | Δ |
|---|---|---|---|
| LaCT‑760M | 45.2 % (LongBench avg.) | 52.8 % | +7.6 % |
| DeltaNet‑1.3B | 48.7 % | 55.9 % | +7.2 % |
| Needle‑in‑a‑haystack (retrieval) | 31.4 % | 38.9 % | +7.5 % |
| Long‑context QA (TriviaQA‑long) | 62.1 % | 70.4 % | +8.3 % |
- Consistent gains across all evaluated tasks, with the largest improvements on tasks that demand maintaining coherence over hundreds to thousands of tokens.
- Test‑time adaptation yields a modest but measurable boost (≈1–2 % absolute) without any extra labeled data, highlighting REFINE’s flexibility.
- Training stability: GRPO prevents the high variance typical of RL in language modeling, achieving convergence within a few hundred thousand steps—comparable to standard supervised fine‑tuning budgets.
Practical Implications
- Memory‑efficient long‑context models – Developers building on edge devices or serving massive numbers of requests can now consider fast‑weight architectures as a viable alternative to full‑attention Transformers, gaining constant‑memory scaling while retaining strong performance.
- Plug‑and‑play improvement – REFINE can be added to existing fast‑weight pipelines without redesigning the model architecture, making it attractive for teams that already have LaCT or DeltaNet‑style models in production.
- Few‑shot adaptation – The test‑time mode enables on‑the‑fly fine‑tuning for domain‑specific documents (e.g., legal contracts, scientific papers) without needing a separate fine‑tuning dataset.
- Better retrieval systems – Needle‑in‑a‑haystack gains translate directly to more accurate semantic search over large corpora, useful for knowledge‑base assistants and code‑search tools.
Limitations & Future Work
- RL overhead – Although GRPO is efficient, the entropy‑based rollout step adds extra compute compared to pure NTP training; scaling to multi‑billion‑parameter fast‑weight models may require further optimization.
- Reward design – The current self‑supervised reward relies on surface‑level n‑gram overlap; richer semantic rewards (e.g., using a learned evaluator) could improve alignment with downstream tasks.
- Generalization beyond fast weights – Applying REFINE to standard Transformer models is non‑trivial because the “fast‑weight” update mechanism is central to the rollout semantics. Extending the idea to hybrid architectures is an open research direction.
- Ablation depth – While the paper presents several ablations, deeper analysis of how rollout length and entropy thresholds affect different language domains would help practitioners fine‑tune the method for specific use cases.
Authors
- Hee Seung Hwang
- Xindi Wu
- Sanghyuk Chun
- Olga Russakovsky
Paper Information
- arXiv ID: 2602.16704v1
- Categories: cs.CL
- Published: February 18, 2026
- PDF: Download PDF