[Paper] From sunblock to softblock: Analyzing the correlates of neology in published writing and on social media
Source: arXiv - 2602.13123v1
Overview
The paper “From sunblock to softblock: Analyzing the correlates of neology in published writing and on social media” investigates why new English words (neologisms) appear in different textual arenas—traditional print media versus Twitter. By extending earlier work that relied on static word embeddings, the authors incorporate contextual embeddings and compare the forces that drive word creation across these two very different communication channels.
Key Contributions
- Cross‑domain analysis: First systematic comparison of neologism drivers in published writing (newspapers, books) and social media (Twitter).
- Contextual embedding methodology: Introduces a pipeline that leverages modern contextual models (e.g., BERT‑style embeddings) alongside classic static vectors to detect and characterize emerging words.
- Replication of prior findings: Confirms that two previously identified factors—semantic novelty and topic popularity growth—correlate with neology in both domains.
- Domain‑specific nuance: Shows that topic popularity growth is a weaker predictor on Twitter, suggesting distinct formation mechanisms (e.g., meme‑driven coinage vs. editorial innovation).
- Open‑source resources: Releases the curated Twitter neologism dataset and code for reproducible embedding‑based analysis.
Methodology
-
Data collection
- Published writing: A historical corpus spanning several decades of newspapers, magazines, and books (the same source used in Ryskina et al., 2020).
- Twitter: A large‑scale dump of public tweets (≈ 200 M posts) filtered for English and time‑stamped to enable longitudinal tracking.
-
Neologism identification
- Built a candidate list of words that were absent from a baseline lexicon (e.g., WordNet) before a given year but appear thereafter.
- Applied frequency thresholds and manual spot‑checks to prune noise (typos, hashtags, usernames).
-
Embedding extraction
- Static embeddings: Trained word2vec on each corpus slice (yearly).
- Contextual embeddings: Fine‑tuned a BERT‑base model on the same slices and extracted token‑level representations for each candidate word in its surrounding context.
-
Feature engineering
- Semantic novelty: Measured cosine distance between a candidate’s embedding and the centroid of its nearest semantic neighbors from the prior year.
- Topic popularity growth: Tracked the rise of the most‑associated topics (via LDA) over time; computed the slope of topic frequency leading up to the word’s first appearance.
-
Statistical analysis
- Ran logistic regression models predicting whether a candidate becomes a “stable” neologism (survives ≥ 2 years) using the two features.
- Compared coefficients across the two domains to assess relative importance.
Results & Findings
| Domain | Semantic Novelty (β) | Topic Popularity Growth (β) | Overall predictive power (AUC) |
|---|---|---|---|
| Published writing | +0.42 (p < 0.001) | +0.31 (p < 0.01) | 0.78 |
| +0.38 (p < 0.001) | +0.12 (p = 0.08) | 0.71 |
- Semantic novelty is a strong, consistent driver in both settings: words that are semantically distant from existing vocabulary are more likely to stick.
- Topic popularity growth matters for print media (where editorial cycles align with emerging public issues) but is only marginal on Twitter, where rapid meme cycles and user‑generated humor dominate.
- Contextual embeddings improve detection of subtle neologisms (e.g., “softblock”) that static vectors miss, especially on Twitter where usage contexts are highly variable.
Practical Implications
- NLP product roadmaps: Language models that need up‑to‑date vocabularies (e.g., chatbots, content moderation tools) can prioritize monitoring semantic novelty signals rather than just trending topics, especially for fast‑moving platforms.
- Lexicography & brand monitoring: Companies can flag emerging brand‑related terms earlier by tracking contextual novelty scores, enabling proactive trademark checks or marketing campaigns.
- Social‑media analytics: Tools that surface emerging slang or jargon can weight semantic distance higher than raw hashtag volume, reducing false positives from fleeting memes.
- Curriculum & language‑learning apps: Understanding that new words in formal writing tend to align with rising topics can help educators curate reading lists that expose learners to the most “useful” neologisms.
Limitations & Future Work
- Lexicon bias: The baseline dictionary may already contain informal or domain‑specific terms, potentially under‑estimating neologism rates on platforms like Twitter.
- Temporal granularity: Yearly slices smooth over rapid bursts of Twitter activity; finer granularity (e.g., weekly) could reveal additional dynamics.
- Language scope: The study focuses exclusively on English; cross‑lingual replication would test whether the observed patterns hold in typologically diverse languages.
- Causal inference: Correlation does not imply causation; future work could experiment with controlled interventions (e.g., seeding topics) to test the hypothesized formation mechanisms.
Bottom line: By marrying modern contextual embeddings with classic linguistic theory, this research shows that the “why” behind new word creation is surprisingly consistent across print and social media—yet the “how” diverges, offering actionable insights for anyone building language‑aware technology.
Authors
- Maria Ryskina
- Matthew R. Gormley
- Kyle Mahowald
- David R. Mortensen
- Taylor Berg‑Kirkpatrick
- Vivek Kulkarni
Paper Information
- arXiv ID: 2602.13123v1
- Categories: cs.CL
- Published: February 13, 2026
- PDF: Download PDF