[Paper] World Properties without World Models: Recovering Spatial and Temporal Structure from Co-occurrence Statistics in Static Word Embeddings

Published: 1 day ago (March 4, 2026 at 12:37 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2603.04317v1

Overview

The paper investigates whether the impressive ability of large language models (LLMs) to “know” where cities are or when historical figures lived actually stems from the models themselves, or simply from patterns already embedded in the raw text. By probing classic static word embeddings (GloVe and Word2Vec) with linear regression, the author shows that a surprising amount of geographic and temporal information can be extracted without any deep neural architecture—suggesting that much of the “world knowledge” is already latent in word co‑occurrence statistics.

Key Contributions

Demonstrates recoverability of world facts (city coordinates, birth years) from static embeddings using ridge‑regression probes.
Quantifies the signal strength: R² = 0.71‑0.87 for geographic location, R² = 0.48‑0.52 for temporal birth‑year prediction.
Identifies lexical gradients (e.g., country names, climate‑related terms) as the primary carriers of spatial/temporal information.
Shows that linear probe performance alone is insufficient evidence for “world‑model” representations in LLMs.
Provides a systematic analysis pipeline (semantic‑neighbor checks, subspace ablations) that can be reused for other probing studies.

Methodology

Embeddings – The study uses two widely‑used static models: GloVe (trained on Common Crawl) and Word2Vec (trained on Google News).
Target variables –
- Geography: latitude and longitude of 1,000+ world cities (ground‑truth from GeoNames).
- Time: birth year of 2,000+ notable historical figures (sourced from Wikipedia infoboxes).
Linear probing – For each target variable, a ridge‑regression model is trained on a random 80 % split of the word vectors and evaluated on the held‑out 20 % set. The R² score measures how much variance the embedding space can explain.
Interpretability checks –
- Semantic‑neighbor analysis: nearest‑neighbor words to a city’s vector are inspected to see whether they form a geographic gradient (e.g., “Paris” close to “Berlin”, “Rome”).
- Subspace ablation: dimensions most correlated with country names or climate terms are zeroed out to test how much the probe’s performance drops, revealing which lexical features drive the signal.

All steps are implemented with standard Python libraries (NumPy, scikit‑learn) and require only the pre‑trained static embeddings—no fine‑tuning or massive compute.

Results & Findings

Target	Embedding	Held‑out R²	Main driver of signal
City coordinates	GloVe	0.84	Country‑name gradients, climate vocab (e.g., “tundra”, “desert”)
City coordinates	Word2Vec	0.71	Same lexical gradients, slightly weaker
Birth year	GloVe	0.52	Historical period terms (e.g., “Renaissance”, “Industrial”)
Birth year	Word2Vec	0.48	Similar temporal vocab, lower magnitude

Ablation experiments show that removing the dimensions aligned with country names drops geographic R² by ~30 %, confirming that these lexical cues are the backbone of the recovered structure. Temporal probes are less sensitive to any single lexical group, indicating a more diffuse signal.

Practical Implications

Feature engineering for downstream NLP – Simple static embeddings can serve as a cheap source of location or era cues for tasks like geotagging, historical text analysis, or recommendation systems without resorting to heavyweight LLMs.
Benchmark design – Researchers should treat linear probe performance on static embeddings as a baseline; claiming “world‑model” capabilities for LLMs requires stronger evidence (e.g., non‑linear probing, causal interventions).
Data‑driven lexicon building – The identified lexical gradients can be harvested to create domain‑specific gazetteers or temporal vocabularies for low‑resource languages where large LLMs are unavailable.
Model interpretability tools – The subspace‑ablation technique offers a lightweight way to diagnose which word groups a model relies on for a given prediction, useful for debugging bias (e.g., over‑reliance on country names).

Limitations & Future Work

Static embeddings are limited to the training corpus; biases or gaps in the source text directly affect the recoverable world knowledge.
Temporal resolution is coarse—the probe captures only broad birth‑year trends, not fine‑grained historical events.
Only linear probes were examined; non‑linear or attention‑based probes might uncover additional structure or confirm that LLMs truly go beyond co‑occurrence statistics.
Geographic scope is constrained to well‑documented cities; extending to rural or indigenous place names could test the limits of lexical gradients.

Future research could combine static‑embedding baselines with controlled LLM experiments, explore multilingual corpora, and develop probing methods that isolate genuine reasoning from statistical memorization.

Authors

Elan Barenholtz

Paper Information

arXiv ID: 2603.04317v1
Categories: cs.CL, cs.AI, cs.LG
Published: March 4, 2026
PDF: Download PDF

[Paper] World Properties without World Models: Recovering Spatial and Temporal Structure from Co-occurrence Statistics in Static Word Embeddings

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation

[Paper] The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

[Paper] Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

[Paper] Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought