[Paper] Less is more: Probabilistic reduction is best explained by small-scale predictability measures

Published: (December 29, 2025 at 01:12 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.23659v1

Overview

This paper asks a surprisingly practical question: How much linguistic context do we really need to link language‑model probabilities to human cognitive behavior? By systematically comparing full‑sentence probabilities with short‑range n‑gram estimates, the authors show that small‑scale predictability measures are sufficient to capture the “probabilistic reduction” effect observed in psycholinguistic experiments. The finding challenges the assumption that large, context‑rich models are always required for cognitive modeling.

Key Contributions

  • Empirical evidence that n‑gram (2‑ to 5‑gram) predictability scores predict human processing difficulty as well as full‑sentence language‑model probabilities.
  • Formal definition of “probabilistic reduction” and a clear experimental protocol for measuring it across different context windows.
  • Cross‑modal validation using eye‑tracking and self‑paced reading datasets, demonstrating robustness across tasks.
  • Open‑source toolkit for extracting n‑gram surprisal and comparing it with transformer‑based surprisal, facilitating reproducibility.
  • Theoretical insight that cognitive planning units may be much smaller than whole utterances, aligning computational models with psycholinguistic theories of incremental processing.

Methodology

  1. Data – The authors used three standard psycholinguistic corpora (the Dundee eye‑tracking corpus, the Natural Stories self‑paced reading dataset, and a spoken‑language comprehension dataset).
  2. Predictability Measures
    • Full‑sentence surprisal was computed with a state‑of‑the‑art transformer LM (GPT‑2).
    • n‑gram surprisal was derived from a smoothed 5‑gram model trained on the same training data.
    • Both measures were log‑probabilities (surprisal) of each target word.
  3. Probabilistic Reduction Test – For each word, they examined whether adding more context (going from 2‑gram → 3‑gram → … → full sentence) significantly improves the correlation with human reading times.
  4. Statistical Analysis – Mixed‑effects regression models with random intercepts for participants and items were used to compare the predictive power of each context size.
  5. Tooling – The authors released a Python package (probred) that automates n‑gram extraction, surprisal calculation, and regression fitting.

Results & Findings

  • Plateau Effect – Correlations between surprisal and reading times plateaued at the 4‑gram level; longer contexts offered no statistically significant gain.
  • Comparable Performance – The 4‑gram model explained ~92 % of the variance captured by the full‑sentence GPT‑2 model across all three corpora.
  • Robustness – The plateau held across modalities (visual vs. auditory) and across participant groups (native vs. non‑native speakers).
  • Efficiency Gains – Computing n‑gram surprisal was >100× faster than transformer‑based surprisal, with negligible loss in explanatory power.

Practical Implications

  • Fast Cognitive Metrics – Developers building real‑time readability or comprehension tools can use lightweight n‑gram surprisal instead of heavyweight transformer models, dramatically reducing latency and compute cost.
  • Simplified Feature Engineering – For NLP pipelines that incorporate human‑like difficulty predictors (e.g., adaptive tutoring systems, voice assistants that anticipate user difficulty), a short‑range n‑gram model is sufficient.
  • Resource‑Constrained Environments – Edge devices, mobile apps, or low‑power IoT speech interfaces can now embed predictive difficulty measures without needing GPU‑accelerated LMs.
  • Interpretability – n‑gram surprisal is transparent (it directly reflects observable word co‑occurrences), making it easier to audit and explain to stakeholders compared to opaque transformer attention patterns.
  • Benchmarking – The released probred toolkit provides a ready‑made benchmark for evaluating new language models against human processing data, encouraging more cognitively grounded NLP research.

Limitations & Future Work

  • Domain Specificity – Experiments were limited to English narrative and spoken corpora; performance on technical or highly domain‑specific texts remains untested.
  • Higher‑Level Phenomena – While n‑grams capture local predictability, they may miss long‑range discourse effects (e.g., anaphora resolution) that could matter for more complex tasks.
  • Model Variants – Only a single transformer (GPT‑2) and a smoothed 5‑gram were evaluated; future work could explore other architectures (e.g., recurrent LMs) and adaptive context windows.
  • Neurocognitive Validation – Extending the analysis to EEG or fMRI data could verify whether the same small‑scale predictability holds at the neural level.

Authors

  • Cassandra L. Jacobs
  • Andrés Buxó-Lugo
  • Anna K. Taylor
  • Marie Leopold-Hooke

Paper Information

  • arXiv ID: 2512.23659v1
  • Categories: cs.CL
  • Published: December 29, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »