[Paper] Emergent Lexical Semantics in Neural Language Models: Testing Martin's Law on LLM-Generated Text

Published: 1 month ago (November 26, 2025 at 07:31 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.21334v1

Overview

Kai Kugler’s paper investigates whether large language models (LLMs) spontaneously learn a classic linguistic pattern known as Martin’s Law—the inverse relationship between a word’s frequency and its number of senses (polysemy). By probing the internal representations of several Pythia models during training, the study uncovers a surprising, non‑linear emergence of this law, challenging the assumption that “more training = better linguistic fidelity.”

Key Contributions

First systematic test of Martin’s Law on LLM‑generated text.
Novel sense‑induction pipeline: DBSCAN clustering of contextualized token embeddings to approximate word senses.
Longitudinal analysis: 30 checkpoints across four Pythia models (70 M – 1 B parameters).
Discovery of a non‑monotonic trajectory: the law peaks early (around checkpoint 104) and then degrades.
Model‑size insights: Small models suffer catastrophic semantic collapse; larger models degrade gracefully.
Frequency‑specificity trade‑off: Remains stable (≈ –0.3 correlation) across all sizes, indicating a persistent tension between word frequency and contextual specificity.
Open methodology: Code and checkpoint data released for reproducibility and future benchmarking.

Methodology

Model selection: Four open‑source Pythia checkpoints (70 M, 160 M, 410 M, 1 B) were trained on the same data corpus.
Sampling text: At each of 30 evenly spaced training steps, the model generated a large corpus of sentences (≈ 200 k tokens per checkpoint).
Embedding extraction: For every token occurrence, the model’s final‑layer hidden state (a contextualized embedding) was recorded.
Sense induction:
- Embeddings for the same surface word were clustered using DBSCAN, a density‑based algorithm that automatically determines the number of clusters.
- Each cluster is interpreted as a distinct “sense” of the word.
Quantifying Martin’s Law:
- Word frequency was computed from the generated corpus.
- Polysemy count = number of DBSCAN clusters per word.
- Pearson correlation (r) between log‑frequency and polysemy was calculated for each checkpoint.
Control analyses: Randomized embeddings and shuffled token orders were used to confirm that observed correlations are not artifacts of the clustering procedure.

Results & Findings

Model (params)	Peak r (Martin’s Law)	Checkpoint of peak	Behavior after peak
70 M	0.45	103	Sharp drop → near‑zero correlation (semantic collapse)
160 M	0.52	104	Similar collapse, though less abrupt
410 M	0.61	104	Gradual decline, still positive at final checkpoint
1 B	0.63	104	Slow degradation, retains moderate correlation

Non‑monotonic emergence: Correlation rises from near‑zero at early checkpoints, peaks around checkpoint 104, then declines—contrary to the expectation that linguistic regularities continuously improve.
Frequency‑specificity trade‑off: Across all models, the correlation between word frequency and contextual specificity stays around –0.3, indicating a stable balancing act that the model never fully resolves.
Semantic collapse in small models: After the peak, the 70 M and 160 M models lose the ability to distinguish senses, effectively “flattening” polysemy. Larger models preserve a richer sense space longer.

Practical Implications

Training schedules: Early‑stage checkpoints may be the sweet spot for applications that rely on nuanced word meanings (e.g., semantic search, sense‑aware translation). Continuing training beyond this window can degrade sense discrimination, especially for smaller models.
Model selection: When resources constrain model size, developers should be aware of the semantic collapse risk and possibly fine‑tune on sense‑rich downstream tasks to recover polysemy.
Evaluation metrics: Martin’s Law can serve as a diagnostic probe for emergent linguistic structure, complementing traditional perplexity or downstream benchmark scores.
Prompt engineering: Knowing that LLMs exhibit a temporary “optimal semantic window,” developers might schedule prompt‑based inference at checkpoints that align with peak polysemy if they have access to intermediate checkpoints (e.g., in research‑grade pipelines).
Safety & bias: A collapse in sense representation could lead to over‑generalization of frequent words, potentially amplifying bias or reducing interpretability. Monitoring polysemy could become part of model‑governance toolkits.

Limitations & Future Work

Sense approximation: DBSCAN clustering of embeddings is an indirect proxy for human‑annotated senses; the method may conflate subtle contextual shifts with genuine polysemy.
Single architecture: Only the Pythia family was examined; results may differ for transformer variants with different training regimes (e.g., decoder‑only vs. encoder‑decoder).
Corpus dependency: The generated text mirrors the training data distribution; applying the methodology to domain‑specific corpora could yield different trajectories.

Future directions

Validate the approach against gold‑standard sense inventories (WordNet, BabelNet).
Extend analysis to multimodal models and instruction‑tuned LLMs.
Investigate interventions (e.g., auxiliary sense‑disambiguation objectives) that could sustain or improve Martin’s Law throughout training.

[Paper] Emergent Lexical Semantics in Neural Language Models: Testing Martin's Law on LLM-Generated Text

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Future directions

Related posts

AI agents find $4.6M in blockchain smart contract exploits

Apple AI chief steps down following Siri setbacks

Apple AI Chief Retiring After Siri Failure

Building AI Agents with Google Gemini 3 and Open Source Frameworks