[Paper] Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity

Published: 3 days ago (April 22, 2026 at 01:14 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.20789v1

Overview

The paper explores how adding human‑like working‑memory limits to Transformer language models can make them learn more efficiently when data is scarce. By tweaking the attention mechanism to mimic fixed‑size windows or temporal decay—behaviors observed in human reading—the authors show that even modestly sized GPT‑2‑style models can achieve better grammatical performance and align more closely with human reading‑time patterns.

Key Contributions

Cognitively‑inspired attention variants: Introduces fixed‑width window attention and temporal‑decay attention as drop‑in replacements for the standard soft‑max attention in Transformers.
Data‑efficient training regime: Trains GPT‑2‑style models from scratch on developmentally plausible corpora (10 M and 100 M tokens) rather than the billions of tokens typical in industry‑scale pre‑training.
Comprehensive evaluation: Benchmarks on the BLiMP suite (grammatical judgment) and correlates model predictions with human reading‑time data to assess cognitive plausibility.
Empirical evidence of inductive bias: Demonstrates that fixed‑width attention yields significant gains in grammatical accuracy under low‑resource conditions and improves alignment with human processing metrics.
Open‑source implementation: Provides code and pretrained checkpoints, enabling reproducibility and easy experimentation for the community.

Methodology

Model Architecture – Starts from the vanilla GPT‑2 decoder stack (12 layers, 768 hidden units). The only change is the attention scoring function:
- Fixed‑width window: Each token attends only to the k most recent tokens (e.g., k = 64), emulating a limited working‑memory buffer.
- Temporal decay: Attention weights are multiplied by an exponential decay factor based on token distance, gradually reducing influence of distant context.
Training Data – Two corpora are constructed to reflect realistic language exposure for a child or a low‑resource language:
- 10 M‑token dataset (≈ 10 × the size of a typical child’s early reading material).
- 100 M‑token dataset (still an order of magnitude smaller than standard LLM pre‑training).
Training Procedure – Models are trained from scratch using the Adam optimizer, a cosine learning‑rate schedule, and standard next‑token prediction loss. No additional supervision or data augmentation is applied.
Evaluation –
- BLiMP (Benchmark of Linguistic Minimal Pairs) tests the model’s ability to distinguish grammatical from ungrammatical sentences across 67 linguistic phenomena.
- Human reading‑time alignment: Model surprisal scores are correlated with eye‑tracking reading‑time datasets (e.g., Dundee Corpus) to gauge cognitive similarity.

The pipeline is deliberately simple so that developers can replicate the experiments with modest GPU resources.

Results & Findings

Model (Data)	BLiMP Avg. Accuracy	Reading‑time Correlation (ρ)
Standard GPT‑2 (10 M)	71.2 %	0.31
Fixed‑width (10 M)	78.5 % (+7.3 pp)	0.38 (+0.07)
Temporal‑decay (10 M)	75.1 %	0.35
Standard GPT‑2 (100 M)	80.4 %	0.42
Fixed‑width (100 M)	85.2 % (+4.8 pp)	0.47 (+0.05)
Temporal‑decay (100 M)	82.9 %	0.44

Key Takeaways

Fixed‑width attention consistently outperforms the vanilla model, especially when training data is limited (10 M tokens).
The gains are not just in raw accuracy; the constrained models produce surprisal patterns that track human reading times more closely, suggesting a more human‑like processing strategy.
Temporal‑decay offers modest improvements, indicating that any form of memory limitation can act as a useful inductive bias, but the hard window is more effective.

Practical Implications

Low‑resource language modeling – Developers building NLP tools for under‑represented languages can adopt windowed attention to squeeze more linguistic competence out of small corpora, reducing the need for massive data collection.
Edge‑device LLMs – Fixed‑width attention naturally limits the number of keys/values each token must attend to, lowering memory bandwidth and compute. This aligns well with on‑device inference constraints (e.g., smartphones, IoT).
Curriculum‑aware training – The approach mirrors how humans learn (starting with short contexts and gradually expanding). Training pipelines could start with narrow windows and progressively widen them, potentially improving convergence speed.
Interpretability & debugging – A bounded attention window makes it easier to trace why a model made a particular prediction, aiding error analysis and compliance audits.
Human‑compatible AI – Better alignment with human reading‑time data may translate to more predictable behavior in human‑in‑the‑loop applications such as assistive writing tools or educational software.

Limitations & Future Work

Scope of tasks – The study focuses on grammatical judgment and reading‑time correlation; it does not evaluate downstream tasks like translation, summarization, or question answering.
Fixed window size – A single window width may be suboptimal across different linguistic phenomena; adaptive or hierarchical windows could yield further gains.
Scalability – Experiments are limited to GPT‑2‑scale models; it remains open how these constraints interact with much larger architectures (e.g., GPT‑3, PaLM).
Human data alignment – Correlation with reading times is modest; richer cognitive signals (e.g., EEG, fMRI) could provide deeper validation.

Future research directions include dynamic memory budgets, multi‑scale attention, and cross‑lingual experiments to verify whether the observed benefits generalize beyond English.

If you’re curious to try these ideas yourself, the authors have released a lightweight PyTorch implementation and pretrained checkpoints on GitHub. Plug the windowed_attention module into any Hugging Face GPT2Model and start experimenting with your own low‑resource datasets.

Authors

Pranava Madhyastha
Dagmar Adamcova

Paper Information

arXiv ID: 2604.20789v1
Categories: cs.CL, cs.AI, cs.LG
Published: April 22, 2026
PDF: Download PDF

[Paper] Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity

Overview

Key Contributions

Methodology

Results & Findings

Key Takeaways

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

[Paper] GiVA: Gradient-Informed Bases for Vector-Based Adaptation

[Paper] TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale

[Paper] A Multimodal Text- and Graph-Based Approach for Open-Domain Event Extraction from Documents