[Paper] Convergent Evolution: How Different Language Models Learn Similar Number Representations
Source: arXiv - 2604.20817v1
Overview
This paper uncovers why a surprisingly wide variety of language models—ranging from classic word‑embeddings to modern Transformers—end up encoding numbers in almost the same way. By probing the Fourier spectrum of the learned representations, the authors show that most models develop periodic features with dominant periods of 2, 5, 10. They then dig deeper to explain when these periodic signals are actually useful for tasks like “what is n mod 5?” and reveal the training conditions that make such geometric separability emerge.
Key Contributions
- Discovery of a universal periodic pattern (periods 2, 5, 10) in number representations across heterogeneous model families.
- Two‑tiered hierarchy: (1) Fourier sparsity—all models exhibit spikes at the key periods; (2) Geometric separability—only some models can linearly separate numbers modulo T.
- Theoretical insight: proved that Fourier sparsity is a necessary but not sufficient condition for mod‑T linear separability.
- Empirical taxonomy of the factors (data, architecture, optimizer, tokenizer) that enable the second tier of separability.
- Identification of two distinct learning routes: (a) co‑occurrence signals in natural text (e.g., “three apples”, “twenty‑four hours”), and (b) multi‑token arithmetic problems that force the model to combine token embeddings.
- Evidence of “convergent evolution”—different models converge on the same representational tricks despite disparate training objectives and structures.
Methodology
- Model Suite – Trained or fine‑tuned a broad set of models: static word2vec/GloVe embeddings, linear RNNs, LSTMs, and Transformer‑based language models (GPT‑style).
- Probing Task – Constructed a simple classification probe: given a token embedding, predict the remainder of the underlying integer modulo T (T ∈ {2, 5, 10}) using a linear classifier.
- Fourier Analysis – Applied a discrete Fourier transform (DFT) to the embedding vectors of numbers 0‑99, looking for spikes at the target periods.
- Geometric Test – Measured linear separability by the probe’s accuracy; high accuracy indicates that the periodic feature is geometrically aligned with a linear decision boundary.
- Controlled Experiments – Varied one factor at a time (e.g., tokenizer granularity, optimizer type, presence/absence of arithmetic examples) to isolate its impact on separability.
- Theoretical Proof – Formalized the relationship between sparsity in the Fourier domain and the existence of a linear separator, showing the former is necessary but not sufficient.
Results & Findings
| Model / Setting | Fourier spikes at T=2,5,10? | Linear mod‑T separability (probe accuracy) |
|---|---|---|
| Static word embeddings (GloVe) | ✅ | Low (≈55 % for T=5) |
| Linear RNN (trained on raw text) | ✅ | Moderate (≈70 % for T=5) |
| LSTM (standard LM) | ✅ | High (≈90 % for T=5) |
| Transformer (GPT‑2 size) | ✅ | Very high (≈96 % for T=5) |
| Same Transformer without arithmetic examples | ✅ | Drops to ~78 % |
| Same Transformer with multi‑token addition data | ✅ | ↑ to ~98 % |
- Fourier sparsity appeared universally—every model’s number embeddings showed clear peaks at the three periods.
- Geometric separability varied dramatically. Architectures with deeper non‑linearities (LSTM, Transformer) and training regimes that exposed the model to numeric co‑occurrence or explicit addition problems achieved near‑perfect linear classification.
- Optimizer effect: Adam‑based training tended to produce sharper Fourier spikes and higher separability than SGD.
- Tokenizer granularity mattered: sub‑word tokenizers that split numbers into multiple tokens (e.g., “12” → “1”, “2”) facilitated learning of addition‑style signals, boosting separability.
Practical Implications
- Prompt Engineering – Knowing that models already encode a clean mod‑T signal suggests that prompts asking for “odd/even” or “multiple‑of‑5” can be answered with minimal prompting, or even with a simple linear read‑out layer on top of the hidden states.
- Debugging Numeric Reasoning – If a model fails on a numeric task, checking its Fourier spectrum can quickly reveal whether the underlying representation is even capable of supporting modular reasoning.
- Model Compression & Distillation – Since the periodic feature is a low‑dimensional, interpretable signal, it could be preserved explicitly during distillation, yielding smaller models that retain numeric competence.
- Tokenizer Design – For applications that require strong arithmetic abilities (e.g., code generation, spreadsheet assistants), using tokenizers that expose multi‑token number structures can be a cheap way to boost performance.
- Data Augmentation – Adding synthetic co‑occurrence or addition examples to the pre‑training corpus is an effective, low‑cost method to induce geometric separability without redesigning the architecture.
- Safety & Auditing – Understanding that many models converge on the same numeric encoding helps auditors predict failure modes (e.g., systematic bias for numbers ending in certain digits) across different model families.
Limitations & Future Work
- The study focuses on English‑language corpora and Arabic numerals; it remains open how these findings translate to languages with different numeral systems or to non‑Arabic scripts.
- Probes were limited to mod‑T classification for T = 2, 5, 10; other numeric properties (e.g., magnitude ordering, prime detection) were not examined.
- The theoretical analysis assumes linear classifiers; non‑linear downstream heads could exploit the periodic features differently.
- Future research directions include: extending the analysis to multimodal models (e.g., vision‑language), exploring continual‑learning scenarios where numeric representations might drift, and designing explicit regularizers that enforce desirable periodic structures during training.
Authors
- Deqing Fu
- Tianyi Zhou
- Mikhail Belkin
- Vatsal Sharan
- Robin Jia
Paper Information
- arXiv ID: 2604.20817v1
- Categories: cs.CL, cs.AI, cs.LG
- Published: April 22, 2026
- PDF: Download PDF