[Paper] Large language models and the entropy of English
Source: arXiv - 2512.24969v1
Overview
The authors harness state‑of‑the‑art large language models (LLMs) to probe how much information about a character in English can be predicted from increasingly long contexts. Their analysis shows that conditional entropy keeps dropping even when the context stretches to ~10 000 characters, revealing surprisingly long‑range dependencies in natural language. This insight reshapes how we think about language modeling, compression, and the statistical physics of text.
Key Contributions
- Empirical evidence of ultra‑long‑range structure: Demonstrated that conditional entropy (or code length) continues to decrease with context lengths up to ~10 k characters across diverse English corpora.
- Model‑independent correlation detection: Verified, directly from raw data, small but statistically significant character‑level correlations at these large separations.
- Entropy distribution analysis: Showed that as context grows, the distribution of per‑character code lengths sharpens, indicating an emergent certainty about a growing fraction of characters.
- Training dynamics insight: Identified distinct learning phases for short vs. long contexts during LLM training, suggesting that long‑range structure is acquired gradually.
- Constraints for physics‑based language models: Provided quantitative benchmarks that any statistical‑physics‑inspired model of language must satisfy.
Methodology
-
Data Collection – The study draws from multiple English text sources (books, news, web text) to ensure broad coverage.
-
LLM Probing – Pre‑trained transformer‑based LLMs (e.g., GPT‑style architectures) are used to compute the conditional probability of each character given a preceding context of length N.
-
Entropy Estimation – For each N, the cross‑entropy (average code length) is calculated:
$$
H(N) = -\frac{1}{L}\sum_{i=1}^{L}\log_2 P(c_i \mid c_{i-N}^{i-1})
$$where (c_i) is the i‑th character and (L) the total sequence length.
-
Correlation Checks – Independent of the model, the authors compute pairwise mutual information between characters separated by up to 10 k positions to confirm that the observed entropy drop isn’t an artifact of the model.
-
Training‑time Analysis – By checkpointing the LLM at various training steps, they track how (H(N)) evolves for short (N < 100) vs. long (N > 1 000) contexts.
All steps are implemented with standard deep‑learning libraries (PyTorch/TensorFlow) and open‑source statistical tools, making the pipeline reproducible for developers.
Results & Findings
| Context Length (N) | Conditional Entropy H(N) (bits/char) | Observation |
|---|---|---|
| 10 – 100 | ~4.5 → 4.2 | Rapid drop, reflecting familiar short‑range syntax. |
| 100 – 1 000 | ~4.2 → 3.9 | Continued improvement; captures paragraph‑level coherence. |
| 1 000 – 10 000 | ~3.9 → 3.7 | Entropy still decreasing, indicating dependencies across whole sections or chapters. |
| >10 000 | Plateau (≈3.6) | Suggests a practical limit for current models/corpora. |
- Correlation detection: Mutual information between characters separated by 5 k–10 k positions is small (~10⁻³ bits) but statistically robust (p < 0.001).
- Training dynamics: Early training epochs quickly reduce entropy for short contexts, while reductions for long contexts become noticeable only after millions of gradient steps.
- Entropy distribution: The variance of per‑character code lengths shrinks with larger N, meaning the model becomes more confident about a larger subset of characters (e.g., predictable function words, recurring phrases).
Practical Implications
- Better Compression Algorithms – Knowing that meaningful predictability extends to thousands of characters can inspire new text compressors that maintain larger sliding windows, achieving higher compression ratios for long documents.
- Prompt Engineering & Retrieval‑Augmented Generation – For LLM‑driven applications (code assistants, chatbots), feeding longer context windows (or using retrieval mechanisms that emulate them) can unlock more coherent, globally consistent outputs.
- Model Architecture Design – The gradual acquisition of long‑range structure suggests benefits for memory‑augmented or hierarchical transformers that allocate dedicated capacity for distant dependencies.
- Evaluation Benchmarks – The entropy‑vs‑context curve provides a quantitative benchmark for future LLMs: a model that flattens earlier is likely missing long‑range linguistic cues.
- Statistical‑Physics Modeling – Researchers attempting to map language to spin‑glass or polymer models now have concrete entropy scaling data to calibrate their theories.
Limitations & Future Work
- Character‑level focus – While character granularity uncovers fine‑scale correlations, word‑ or subword‑level analyses could reveal additional structure relevant to modern tokenizers.
- Corpus diversity – The study primarily uses standard English prose; extending to code, scientific writing, or multilingual corpora may exhibit different scaling behaviors.
- Model family – Experiments are limited to transformer‑based LLMs; other architectures (e.g., recurrent, convolutional) might learn long‑range patterns differently.
- Computational cost – Estimating entropy for N ≈ 10⁴ requires substantial GPU memory and inference time, which may limit reproducibility for smaller teams.
Future research directions include scaling the analysis to megabyte‑scale contexts, exploring adaptive context windows during inference, and integrating physics‑inspired regularizers that explicitly encourage long‑range consistency.
Authors
- Colin Scheibner
- Lindsay M. Smith
- William Bialek
Paper Information
- arXiv ID: 2512.24969v1
- Categories: cond-mat.stat-mech, cs.CL, physics.bio-ph, q-bio.NC
- Published: December 31, 2025
- PDF: Download PDF