[Paper] From sunblock to softblock: Analyzing the correlates of neology in published writing and on social media

Published: 2 months ago (February 13, 2026 at 12:19 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.13123v1

Overview

The paper “From sunblock to softblock: Analyzing the correlates of neology in published writing and on social media” investigates why new English words (neologisms) appear in different textual arenas—traditional print media versus Twitter. By extending earlier work that relied on static word embeddings, the authors incorporate contextual embeddings and compare the forces that drive word creation across these two very different communication channels.

Key Contributions

Cross‑domain analysis: First systematic comparison of neologism drivers in published writing (newspapers, books) and social media (Twitter).
Contextual embedding methodology: Introduces a pipeline that leverages modern contextual models (e.g., BERT‑style embeddings) alongside classic static vectors to detect and characterize emerging words.
Replication of prior findings: Confirms that two previously identified factors—semantic novelty and topic popularity growth—correlate with neology in both domains.
Domain‑specific nuance: Shows that topic popularity growth is a weaker predictor on Twitter, suggesting distinct formation mechanisms (e.g., meme‑driven coinage vs. editorial innovation).
Open‑source resources: Releases the curated Twitter neologism dataset and code for reproducible embedding‑based analysis.

Methodology

Data collection
- Published writing: A historical corpus spanning several decades of newspapers, magazines, and books (the same source used in Ryskina et al., 2020).
- Twitter: A large‑scale dump of public tweets (≈ 200 M posts) filtered for English and time‑stamped to enable longitudinal tracking.
Neologism identification
- Built a candidate list of words that were absent from a baseline lexicon (e.g., WordNet) before a given year but appear thereafter.
- Applied frequency thresholds and manual spot‑checks to prune noise (typos, hashtags, usernames).
Embedding extraction
- Static embeddings: Trained word2vec on each corpus slice (yearly).
- Contextual embeddings: Fine‑tuned a BERT‑base model on the same slices and extracted token‑level representations for each candidate word in its surrounding context.
Feature engineering
- Semantic novelty: Measured cosine distance between a candidate’s embedding and the centroid of its nearest semantic neighbors from the prior year.
- Topic popularity growth: Tracked the rise of the most‑associated topics (via LDA) over time; computed the slope of topic frequency leading up to the word’s first appearance.
Statistical analysis
- Ran logistic regression models predicting whether a candidate becomes a “stable” neologism (survives ≥ 2 years) using the two features.
- Compared coefficients across the two domains to assess relative importance.

Results & Findings

Domain	Semantic Novelty (β)	Topic Popularity Growth (β)	Overall predictive power (AUC)
Published writing	+0.42 (p < 0.001)	+0.31 (p < 0.01)	0.78
Twitter	+0.38 (p < 0.001)	+0.12 (p = 0.08)	0.71

Semantic novelty is a strong, consistent driver in both settings: words that are semantically distant from existing vocabulary are more likely to stick.
Topic popularity growth matters for print media (where editorial cycles align with emerging public issues) but is only marginal on Twitter, where rapid meme cycles and user‑generated humor dominate.
Contextual embeddings improve detection of subtle neologisms (e.g., “softblock”) that static vectors miss, especially on Twitter where usage contexts are highly variable.

Practical Implications

NLP product roadmaps: Language models that need up‑to‑date vocabularies (e.g., chatbots, content moderation tools) can prioritize monitoring semantic novelty signals rather than just trending topics, especially for fast‑moving platforms.
Lexicography & brand monitoring: Companies can flag emerging brand‑related terms earlier by tracking contextual novelty scores, enabling proactive trademark checks or marketing campaigns.
Social‑media analytics: Tools that surface emerging slang or jargon can weight semantic distance higher than raw hashtag volume, reducing false positives from fleeting memes.
Curriculum & language‑learning apps: Understanding that new words in formal writing tend to align with rising topics can help educators curate reading lists that expose learners to the most “useful” neologisms.

Limitations & Future Work

Lexicon bias: The baseline dictionary may already contain informal or domain‑specific terms, potentially under‑estimating neologism rates on platforms like Twitter.
Temporal granularity: Yearly slices smooth over rapid bursts of Twitter activity; finer granularity (e.g., weekly) could reveal additional dynamics.
Language scope: The study focuses exclusively on English; cross‑lingual replication would test whether the observed patterns hold in typologically diverse languages.
Causal inference: Correlation does not imply causation; future work could experiment with controlled interventions (e.g., seeding topics) to test the hypothesized formation mechanisms.

Bottom line: By marrying modern contextual embeddings with classic linguistic theory, this research shows that the “why” behind new word creation is surprisingly consistent across print and social media—yet the “how” diverges, offering actionable insights for anyone building language‑aware technology.

Authors

Maria Ryskina
Matthew R. Gormley
Kyle Mahowald
David R. Mortensen
Taylor Berg‑Kirkpatrick
Vivek Kulkarni

Paper Information

arXiv ID: 2602.13123v1
Categories: cs.CL
Published: February 13, 2026
PDF: Download PDF

[Paper] From sunblock to softblock: Analyzing the correlates of neology in published writing and on social media

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Semantic Chunking and the Entropy of Natural Language

[Paper] CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

[Paper] Quantization-Robust LLM Unlearning via Low-Rank Adaptation

[Paper] OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report