[Paper] Zipf Distributions from Two-Stage Symbolic Processes: Stability Under Stochastic Lexical Filtering

Published: (November 25, 2025 at 11:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.21060v1

Overview

Vladimir Berman’s paper tackles a classic puzzle in computational linguistics: why word frequencies in natural language follow Zipf’s law (a rank‑frequency power‑law). Rather than invoking communication efficiency or cognitive constraints, the work shows that a purely geometric, two‑stage symbolic process can generate Zipf‑like distributions. The result is a simple, mathematically grounded model that reproduces the frequency patterns observed in English, Russian, and mixed‑genre corpora.

Key Contributions

  • Full Combinatorial Word Model (FCWM): Introduces a generative process that builds words from a finite alphabet plus a “blank” symbol, yielding a geometric distribution of word lengths.
  • Two‑stage stochastic filtering: Shows how a second stochastic step—lexical filtering that discards some generated strings—transforms the geometric length distribution into a power‑law rank‑frequency curve.
  • Closed‑form relationship: Derives an explicit formula linking the Zipf exponent to the alphabet size and the probability of the blank symbol.
  • Empirical validation: Provides extensive simulations and fits to real corpora (English, Russian, mixed‑genre) that match the theoretical predictions without any language‑specific tuning.
  • Conceptual shift: Argues that Zipf‑type laws can emerge from purely combinatorial constraints, challenging explanations that rely on communicative optimality.

Methodology

  1. Word generation (Stage 1):

    • Start with a finite alphabet A of size k and a special “blank” token .

    • Produce a sequence of symbols by repeatedly drawing from A ∪ {□} with fixed probabilities.

    • The process stops when a blank is drawn, so the length of the generated string follows a geometric distribution:

      [ P(\ell) = (1-p)^{\ell-1}p, ]

      where p is the probability of drawing the blank.

  2. Lexical filtering (Stage 2):

    • Not every generated string becomes a “word” in the lexicon. The model applies a stochastic filter that retains each string with probability proportional to an exponential “force” that depends on its length.
    • This filtering step introduces an exponential bias that, when combined with the geometric length distribution, yields a power‑law distribution over the ranks of the surviving strings.
  3. Analytical derivation:

    • By treating the two exponential factors (geometric length and filtering bias) as interacting forces, the author derives a rank‑frequency relation of the form

      [ f(r) \propto r^{-\alpha}, ]

      where the exponent (\alpha) is a simple function of k and p.

  4. Simulation & empirical fitting:

    • Large‑scale Monte‑Carlo simulations generate synthetic corpora under various (k, p) settings.
    • The synthetic rank‑frequency curves are compared to real‑world corpora using standard goodness‑of‑fit metrics (Kolmogorov–Smirnov, (R^{2})).

Results & Findings

  • Theoretical exponent matches data: For English (≈26 letters + space) and Russian (≈33 Cyrillic letters + space), the predicted (\alpha) values (≈1.0–1.2) align closely with the empirically observed Zipf slopes.
  • Robustness across genres: When mixing news, literature, and technical text, the model still captures the overall power‑law shape, indicating that the mechanism is genre‑agnostic.
  • Parameter sensitivity: Varying the blank probability p shifts the exponent smoothly; higher p (more frequent blanks) leads to steeper slopes, matching the intuition that shorter average word lengths produce a sharper frequency drop‑off.
  • No need for linguistic priors: The model reproduces Zipf’s law without any assumptions about meaning, syntax, or communicative cost, suggesting that the law may be a by‑product of combinatorial constraints.

Practical Implications

  • Synthetic text generation: Developers building language models or test corpora can use the FCWM to generate realistic word‑frequency distributions without needing large real datasets.
  • Vocabulary sizing for NLP pipelines: The explicit relationship between alphabet size, blank probability, and Zipf exponent can help estimate the expected vocabulary growth when expanding token sets (e.g., adding subword units).
  • Compression & storage optimization: Understanding that Zipf‑like skew can arise from simple combinatorial processes informs better entropy coding schemes for token streams, especially in low‑resource or domain‑specific settings.
  • Benchmark design: When evaluating language‑model robustness, synthetic benchmarks derived from FCWM can isolate the effect of frequency distribution from higher‑order linguistic structure.
  • Cross‑lingual transfer: Since the model abstracts away from language‑specific rules, it can serve as a neutral baseline for comparing frequency dynamics across languages, aiding multilingual tokenization strategies.

Limitations & Future Work

  • No semantic component: The model treats all generated strings as equally meaningful, which limits its ability to explain phenomena that depend on semantics (e.g., word‑sense disambiguation, topic modeling).
  • Fixed alphabet assumption: Real languages evolve their orthographies; extending the model to dynamic or hierarchical alphabets (e.g., Unicode grapheme clusters) could improve realism.
  • Lexical filtering simplification: The stochastic filter is a proxy for morphological and phonotactic constraints; future work could replace it with linguistically informed constraints to bridge the gap between pure combinatorics and actual word formation rules.
  • Empirical breadth: While English and Russian are covered, testing the model on languages with non‑alphabetic scripts (Chinese, Arabic) would assess its universality.

Bottom line: Berman’s two‑stage symbolic process offers a clean, mathematically tractable explanation for Zipf’s law that resonates with developers looking for principled ways to model word‑frequency behavior, generate synthetic corpora, or reason about vocabulary dynamics in modern NLP systems.

Authors

  • Vladimir Berman

Paper Information

  • arXiv ID: 2511.21060v1
  • Categories: stat.ME, cs.CL, stat.ML
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »