[Paper] Convergent Evolution: How Different Language Models Learn Similar Number Representations

Published: 2 days ago (April 22, 2026 at 01:45 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.20817v1

Overview

This paper uncovers why a surprisingly wide variety of language models—ranging from classic word‑embeddings to modern Transformers—end up encoding numbers in almost the same way. By probing the Fourier spectrum of the learned representations, the authors show that most models develop periodic features with dominant periods of 2, 5, 10. They then dig deeper to explain when these periodic signals are actually useful for tasks like “what is n mod 5?” and reveal the training conditions that make such geometric separability emerge.

Key Contributions

Discovery of a universal periodic pattern (periods 2, 5, 10) in number representations across heterogeneous model families.
Two‑tiered hierarchy: (1) Fourier sparsity—all models exhibit spikes at the key periods; (2) Geometric separability—only some models can linearly separate numbers modulo T.
Theoretical insight: proved that Fourier sparsity is a necessary but not sufficient condition for mod‑T linear separability.
Empirical taxonomy of the factors (data, architecture, optimizer, tokenizer) that enable the second tier of separability.
Identification of two distinct learning routes: (a) co‑occurrence signals in natural text (e.g., “three apples”, “twenty‑four hours”), and (b) multi‑token arithmetic problems that force the model to combine token embeddings.
Evidence of “convergent evolution”—different models converge on the same representational tricks despite disparate training objectives and structures.

Methodology

Model Suite – Trained or fine‑tuned a broad set of models: static word2vec/GloVe embeddings, linear RNNs, LSTMs, and Transformer‑based language models (GPT‑style).
Probing Task – Constructed a simple classification probe: given a token embedding, predict the remainder of the underlying integer modulo T (T ∈ {2, 5, 10}) using a linear classifier.
Fourier Analysis – Applied a discrete Fourier transform (DFT) to the embedding vectors of numbers 0‑99, looking for spikes at the target periods.
Geometric Test – Measured linear separability by the probe’s accuracy; high accuracy indicates that the periodic feature is geometrically aligned with a linear decision boundary.
Controlled Experiments – Varied one factor at a time (e.g., tokenizer granularity, optimizer type, presence/absence of arithmetic examples) to isolate its impact on separability.
Theoretical Proof – Formalized the relationship between sparsity in the Fourier domain and the existence of a linear separator, showing the former is necessary but not sufficient.

Results & Findings

Model / Setting	Fourier spikes at T=2,5,10?	Linear mod‑T separability (probe accuracy)
Static word embeddings (GloVe)	✅	Low (≈55 % for T=5)
Linear RNN (trained on raw text)	✅	Moderate (≈70 % for T=5)
LSTM (standard LM)	✅	High (≈90 % for T=5)
Transformer (GPT‑2 size)	✅	Very high (≈96 % for T=5)
Same Transformer without arithmetic examples	✅	Drops to ~78 %
Same Transformer with multi‑token addition data	✅	↑ to ~98 %

Fourier sparsity appeared universally—every model’s number embeddings showed clear peaks at the three periods.
Geometric separability varied dramatically. Architectures with deeper non‑linearities (LSTM, Transformer) and training regimes that exposed the model to numeric co‑occurrence or explicit addition problems achieved near‑perfect linear classification.
Optimizer effect: Adam‑based training tended to produce sharper Fourier spikes and higher separability than SGD.
Tokenizer granularity mattered: sub‑word tokenizers that split numbers into multiple tokens (e.g., “12” → “1”, “2”) facilitated learning of addition‑style signals, boosting separability.

Practical Implications

Prompt Engineering – Knowing that models already encode a clean mod‑T signal suggests that prompts asking for “odd/even” or “multiple‑of‑5” can be answered with minimal prompting, or even with a simple linear read‑out layer on top of the hidden states.
Debugging Numeric Reasoning – If a model fails on a numeric task, checking its Fourier spectrum can quickly reveal whether the underlying representation is even capable of supporting modular reasoning.
Model Compression & Distillation – Since the periodic feature is a low‑dimensional, interpretable signal, it could be preserved explicitly during distillation, yielding smaller models that retain numeric competence.
Tokenizer Design – For applications that require strong arithmetic abilities (e.g., code generation, spreadsheet assistants), using tokenizers that expose multi‑token number structures can be a cheap way to boost performance.
Data Augmentation – Adding synthetic co‑occurrence or addition examples to the pre‑training corpus is an effective, low‑cost method to induce geometric separability without redesigning the architecture.
Safety & Auditing – Understanding that many models converge on the same numeric encoding helps auditors predict failure modes (e.g., systematic bias for numbers ending in certain digits) across different model families.

Limitations & Future Work

The study focuses on English‑language corpora and Arabic numerals; it remains open how these findings translate to languages with different numeral systems or to non‑Arabic scripts.
Probes were limited to mod‑T classification for T = 2, 5, 10; other numeric properties (e.g., magnitude ordering, prime detection) were not examined.
The theoretical analysis assumes linear classifiers; non‑linear downstream heads could exploit the periodic features differently.
Future research directions include: extending the analysis to multimodal models (e.g., vision‑language), exploring continual‑learning scenarios where numeric representations might drift, and designing explicit regularizers that enforce desirable periodic structures during training.

Authors

Deqing Fu
Tianyi Zhou
Mikhail Belkin
Vatsal Sharan
Robin Jia

Paper Information

arXiv ID: 2604.20817v1
Categories: cs.CL, cs.AI, cs.LG
Published: April 22, 2026
PDF: Download PDF

[Paper] Convergent Evolution: How Different Language Models Learn Similar Number Representations

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

[Paper] GiVA: Gradient-Informed Bases for Vector-Based Adaptation

[Paper] TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale

[Paper] A Multimodal Text- and Graph-Based Approach for Open-Domain Event Extraction from Documents