[Paper] Emergent Lexical Semantics in Neural Language Models: Testing Martin's Law on LLM-Generated Text

Published: (November 26, 2025 at 07:31 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.21334v1

Overview

Kai Kugler’s paper investigates whether large language models (LLMs) spontaneously learn a classic linguistic pattern known as Martin’s Law—the inverse relationship between a word’s frequency and its number of senses (polysemy). By probing the internal representations of several Pythia models during training, the study uncovers a surprising, non‑linear emergence of this law, challenging the assumption that “more training = better linguistic fidelity.”

Key Contributions

  • First systematic test of Martin’s Law on LLM‑generated text.
  • Novel sense‑induction pipeline: DBSCAN clustering of contextualized token embeddings to approximate word senses.
  • Longitudinal analysis: 30 checkpoints across four Pythia models (70 M – 1 B parameters).
  • Discovery of a non‑monotonic trajectory: the law peaks early (around checkpoint 104) and then degrades.
  • Model‑size insights: Small models suffer catastrophic semantic collapse; larger models degrade gracefully.
  • Frequency‑specificity trade‑off: Remains stable (≈ –0.3 correlation) across all sizes, indicating a persistent tension between word frequency and contextual specificity.
  • Open methodology: Code and checkpoint data released for reproducibility and future benchmarking.

Methodology

  1. Model selection: Four open‑source Pythia checkpoints (70 M, 160 M, 410 M, 1 B) were trained on the same data corpus.
  2. Sampling text: At each of 30 evenly spaced training steps, the model generated a large corpus of sentences (≈ 200 k tokens per checkpoint).
  3. Embedding extraction: For every token occurrence, the model’s final‑layer hidden state (a contextualized embedding) was recorded.
  4. Sense induction:
    • Embeddings for the same surface word were clustered using DBSCAN, a density‑based algorithm that automatically determines the number of clusters.
    • Each cluster is interpreted as a distinct “sense” of the word.
  5. Quantifying Martin’s Law:
    • Word frequency was computed from the generated corpus.
    • Polysemy count = number of DBSCAN clusters per word.
    • Pearson correlation (r) between log‑frequency and polysemy was calculated for each checkpoint.
  6. Control analyses: Randomized embeddings and shuffled token orders were used to confirm that observed correlations are not artifacts of the clustering procedure.

Results & Findings

Model (params)Peak r (Martin’s Law)Checkpoint of peakBehavior after peak
70 M0.45103Sharp drop → near‑zero correlation (semantic collapse)
160 M0.52104Similar collapse, though less abrupt
410 M0.61104Gradual decline, still positive at final checkpoint
1 B0.63104Slow degradation, retains moderate correlation
  • Non‑monotonic emergence: Correlation rises from near‑zero at early checkpoints, peaks around checkpoint 104, then declines—contrary to the expectation that linguistic regularities continuously improve.
  • Frequency‑specificity trade‑off: Across all models, the correlation between word frequency and contextual specificity stays around –0.3, indicating a stable balancing act that the model never fully resolves.
  • Semantic collapse in small models: After the peak, the 70 M and 160 M models lose the ability to distinguish senses, effectively “flattening” polysemy. Larger models preserve a richer sense space longer.

Practical Implications

  • Training schedules: Early‑stage checkpoints may be the sweet spot for applications that rely on nuanced word meanings (e.g., semantic search, sense‑aware translation). Continuing training beyond this window can degrade sense discrimination, especially for smaller models.
  • Model selection: When resources constrain model size, developers should be aware of the semantic collapse risk and possibly fine‑tune on sense‑rich downstream tasks to recover polysemy.
  • Evaluation metrics: Martin’s Law can serve as a diagnostic probe for emergent linguistic structure, complementing traditional perplexity or downstream benchmark scores.
  • Prompt engineering: Knowing that LLMs exhibit a temporary “optimal semantic window,” developers might schedule prompt‑based inference at checkpoints that align with peak polysemy if they have access to intermediate checkpoints (e.g., in research‑grade pipelines).
  • Safety & bias: A collapse in sense representation could lead to over‑generalization of frequent words, potentially amplifying bias or reducing interpretability. Monitoring polysemy could become part of model‑governance toolkits.

Limitations & Future Work

  • Sense approximation: DBSCAN clustering of embeddings is an indirect proxy for human‑annotated senses; the method may conflate subtle contextual shifts with genuine polysemy.
  • Single architecture: Only the Pythia family was examined; results may differ for transformer variants with different training regimes (e.g., decoder‑only vs. encoder‑decoder).
  • Corpus dependency: The generated text mirrors the training data distribution; applying the methodology to domain‑specific corpora could yield different trajectories.

Future directions

  • Validate the approach against gold‑standard sense inventories (WordNet, BabelNet).
  • Extend analysis to multimodal models and instruction‑tuned LLMs.
  • Investigate interventions (e.g., auxiliary sense‑disambiguation objectives) that could sustain or improve Martin’s Law throughout training.
Back to Blog

Related posts

Read more »

It’s code red for ChatGPT

A smidge over three years ago, OpenAI threw the rest of the tech industry into chaos. When ChatGPT launched, even billed as a 'low-key research preview,' it bec...