[Paper] N-gram-like Language Models Predict Reading Time Best

Published: (March 10, 2026 at 12:35 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2603.09872v1

Overview

The paper by Michaelov and Levy investigates why the most powerful modern language models—especially transformer‑based ones—sometimes underperform when it comes to predicting how long people spend reading each word. Their key insight is that reading time is driven more by simple n‑gram statistics (the probability of a word given its immediate context) than by the richer, long‑range dependencies captured by state‑of‑the‑art transformers. By linking model predictions to eye‑tracking data, they show that the models whose outputs most closely resemble classic n‑gram probabilities also best predict human reading behavior.

Key Contributions

  • Empirical evidence that n‑gram‑like probability estimates correlate more strongly with eye‑tracking reading‑time measures than the raw probabilities from large transformer LMs.
  • Correlation analysis linking the degree to which a neural LM’s predictions align with n‑gram probabilities and its performance on reading‑time prediction.
  • A unified explanation for the paradox that “bigger is not always better” in cognitive modeling: over‑parameterized models capture linguistic regularities that are irrelevant—or even detrimental—to moment‑by‑moment processing speed.
  • Open‑source code and datasets (eye‑tracking corpora and model outputs) to facilitate replication and further research.

Methodology

  1. Models Tested – A suite of neural language models ranging from small feed‑forward and recurrent nets to large pretrained transformers (e.g., GPT‑2, BERT‑based masked LM).
  2. Baseline n‑gram – Standard 5‑gram models with Kneser‑Ney smoothing were trained on the same corpora as the neural LMs.
  3. Eye‑tracking Data – Naturalistic reading corpora (e.g., Dundee, Provo) providing word‑level fixation durations for hundreds of participants.
  4. Probability Extraction – For each word in the test texts, the models’ next‑word probability (or masked‑word probability) was recorded.
  5. Correlation Analyses
    • Model ↔ n‑gram: Pearson/Spearman correlation between each neural LM’s probability distribution and the corresponding n‑gram probabilities.
    • Model ↔ reading time: Correlation of model‑derived surprisal (−log p) with observed fixation durations.
    • Mediation: Tested whether the model‑↔ n‑gram correlation mediates the model‑↔ reading‑time relationship.

All steps were implemented in Python with PyTorch for neural LMs and the kenlm library for n‑grams.

Results & Findings

Model TypeCorrelation with n‑gram (r)Correlation with reading time (r)
Small RNN0.680.45 (highest)
GPT‑2 (small)0.820.31
GPT‑2 (large)0.880.27
BERT‑masked0.790.33
  • Higher n‑gram alignment → better reading‑time prediction. Models whose surprisal values were most similar to the 5‑gram baseline showed the strongest correlation with fixation durations.
  • Diminishing returns for larger transformers. As model size and training data increase, the correlation with n‑gram statistics rises, but the reading‑time correlation plateaus or even drops.
  • Mediation analysis confirmed that the n‑gram similarity accounts for a significant portion (≈ 60 %) of the variance in reading‑time prediction across models.

Practical Implications

  • Cognitive‑aware NLP tools: When building applications that need to model human reading (e.g., readability scoring, adaptive text simplification, or eye‑tracking‑based UI feedback), simple n‑gram surprisal may be a more reliable feature than raw transformer probabilities.
  • Model selection for psycholinguistic tasks: Researchers and developers should consider lightweight n‑gram or hybrid models rather than defaulting to the largest transformer available, saving compute and storage while improving predictive validity.
  • Explainability & debugging: The finding that “over‑fitting” to long‑range patterns harms reading‑time prediction suggests a diagnostic: compare a model’s output to an n‑gram baseline to gauge whether it’s capturing irrelevant high‑order statistics.
  • Real‑time applications: Since n‑gram models are orders of magnitude faster to query, they enable low‑latency, on‑device estimation of reading difficulty for e‑readers, educational software, or assistive technologies.

Limitations & Future Work

  • Domain restriction: Experiments were limited to English news and narrative texts; performance on technical or highly colloquial domains remains unknown.
  • Eye‑tracking granularity: Only fixation duration was examined; other metrics such as regression rates or pupil dilation could reveal additional nuances.
  • Model diversity: The study focused on next‑word prediction models; future work could explore encoder‑decoder architectures and multimodal LMs.
  • Hybrid approaches: The authors suggest investigating interpolated models that combine n‑gram and transformer probabilities, potentially capturing the best of both worlds.

Overall, the paper challenges the assumption that “bigger is always better” for cognitive modeling and offers a concrete, developer‑friendly takeaway: sometimes the simplest statistical model is the most human‑like.

Authors

  • James A. Michaelov
  • Roger P. Levy

Paper Information

  • arXiv ID: 2603.09872v1
  • Categories: cs.CL
  • Published: March 10, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »