[Paper] World Properties without World Models: Recovering Spatial and Temporal Structure from Co-occurrence Statistics in Static Word Embeddings

Published: (March 4, 2026 at 12:37 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2603.04317v1

Overview

The paper investigates whether the impressive ability of large language models (LLMs) to “know” where cities are or when historical figures lived actually stems from the models themselves, or simply from patterns already embedded in the raw text. By probing classic static word embeddings (GloVe and Word2Vec) with linear regression, the author shows that a surprising amount of geographic and temporal information can be extracted without any deep neural architecture—suggesting that much of the “world knowledge” is already latent in word co‑occurrence statistics.

Key Contributions

  • Demonstrates recoverability of world facts (city coordinates, birth years) from static embeddings using ridge‑regression probes.
  • Quantifies the signal strength: R² = 0.71‑0.87 for geographic location, R² = 0.48‑0.52 for temporal birth‑year prediction.
  • Identifies lexical gradients (e.g., country names, climate‑related terms) as the primary carriers of spatial/temporal information.
  • Shows that linear probe performance alone is insufficient evidence for “world‑model” representations in LLMs.
  • Provides a systematic analysis pipeline (semantic‑neighbor checks, subspace ablations) that can be reused for other probing studies.

Methodology

  1. Embeddings – The study uses two widely‑used static models: GloVe (trained on Common Crawl) and Word2Vec (trained on Google News).
  2. Target variables
    • Geography: latitude and longitude of 1,000+ world cities (ground‑truth from GeoNames).
    • Time: birth year of 2,000+ notable historical figures (sourced from Wikipedia infoboxes).
  3. Linear probing – For each target variable, a ridge‑regression model is trained on a random 80 % split of the word vectors and evaluated on the held‑out 20 % set. The R² score measures how much variance the embedding space can explain.
  4. Interpretability checks
    • Semantic‑neighbor analysis: nearest‑neighbor words to a city’s vector are inspected to see whether they form a geographic gradient (e.g., “Paris” close to “Berlin”, “Rome”).
    • Subspace ablation: dimensions most correlated with country names or climate terms are zeroed out to test how much the probe’s performance drops, revealing which lexical features drive the signal.

All steps are implemented with standard Python libraries (NumPy, scikit‑learn) and require only the pre‑trained static embeddings—no fine‑tuning or massive compute.

Results & Findings

TargetEmbeddingHeld‑out R²Main driver of signal
City coordinatesGloVe0.84Country‑name gradients, climate vocab (e.g., “tundra”, “desert”)
City coordinatesWord2Vec0.71Same lexical gradients, slightly weaker
Birth yearGloVe0.52Historical period terms (e.g., “Renaissance”, “Industrial”)
Birth yearWord2Vec0.48Similar temporal vocab, lower magnitude

Ablation experiments show that removing the dimensions aligned with country names drops geographic R² by ~30 %, confirming that these lexical cues are the backbone of the recovered structure. Temporal probes are less sensitive to any single lexical group, indicating a more diffuse signal.

Practical Implications

  • Feature engineering for downstream NLP – Simple static embeddings can serve as a cheap source of location or era cues for tasks like geotagging, historical text analysis, or recommendation systems without resorting to heavyweight LLMs.
  • Benchmark design – Researchers should treat linear probe performance on static embeddings as a baseline; claiming “world‑model” capabilities for LLMs requires stronger evidence (e.g., non‑linear probing, causal interventions).
  • Data‑driven lexicon building – The identified lexical gradients can be harvested to create domain‑specific gazetteers or temporal vocabularies for low‑resource languages where large LLMs are unavailable.
  • Model interpretability tools – The subspace‑ablation technique offers a lightweight way to diagnose which word groups a model relies on for a given prediction, useful for debugging bias (e.g., over‑reliance on country names).

Limitations & Future Work

  • Static embeddings are limited to the training corpus; biases or gaps in the source text directly affect the recoverable world knowledge.
  • Temporal resolution is coarse—the probe captures only broad birth‑year trends, not fine‑grained historical events.
  • Only linear probes were examined; non‑linear or attention‑based probes might uncover additional structure or confirm that LLMs truly go beyond co‑occurrence statistics.
  • Geographic scope is constrained to well‑documented cities; extending to rural or indigenous place names could test the limits of lexical gradients.

Future research could combine static‑embedding baselines with controlled LLM experiments, explore multilingual corpora, and develop probing methods that isolate genuine reasoning from statistical memorization.

Authors

  • Elan Barenholtz

Paper Information

  • arXiv ID: 2603.04317v1
  • Categories: cs.CL, cs.AI, cs.LG
  • Published: March 4, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »