[Paper] Accurate and Efficient Statistical Testing for Word Semantic Breadth
Source: arXiv - 2605.08048v1
Overview
The paper tackles a subtle but critical problem in modern NLP: how to reliably compare the semantic breadth of two words using contextualized embeddings. While dispersion‑based statistics can capture how widely a word is used across contexts, naïve hypothesis testing can mistake shifts in meaning direction for genuine breadth differences, leading to false‑positive findings. The author introduces a Householder‑aligned permutation test that cleanly separates dispersion from directional effects, delivering statistically sound and computationally efficient comparisons.
Key Contributions
- Novel statistical test: A permutation test that first aligns the mean vectors of two word token clouds via a single Householder reflection, isolating pure dispersion differences.
- GPU‑friendly implementation: Batches the permutation and linear‑algebra steps to run efficiently on modern GPUs, achieving a 23× speedup over a CPU baseline.
- Empirical validation: Demonstrates a 32.5% reduction in Type‑I error (false positives) while maintaining power to detect real breadth differences.
- Practical toolkit: Provides open‑source code that can be dropped into existing embedding pipelines for quick semantic‑breadth analysis.
Methodology
- Token cloud construction – For each target word, collect all its contextualized token embeddings (e.g., from BERT) across a corpus, forming a high‑dimensional point cloud.
- Compute dispersion – Use a simple statistic such as average pairwise distance or covariance trace to quantify how spread out the cloud is.
- Householder alignment – Compute a Householder matrix that reflects one cloud onto the other so their mean vectors coincide. This removes any systematic shift in direction while preserving the internal geometry of each cloud.
- Permutation test – Randomly shuffle token embeddings between the two aligned clouds many times, recompute the dispersion difference for each shuffle, and build a non‑parametric null distribution.
- p‑value extraction – The proportion of shuffled differences that exceed the observed difference yields a calibrated p‑value, free from directional confounds.
All steps are expressed as matrix operations that can be batched and executed on a GPU, dramatically cutting runtime.
Results & Findings
- Type‑I error control: On synthetic datasets where the true breadth is identical but means differ, the aligned test cuts false‑positive rates from ~0.10 to ~0.067 (≈32.5% reduction).
- Statistical power: When genuine breadth differences are introduced, the test retains comparable detection rates to the naïve approach, confirming that alignment does not over‑correct.
- Performance: Processing 10,000 token embeddings with 10,000 permutations runs in ~0.8 s on an NVIDIA RTX 3090, versus ~18 s on a 16‑core CPU.
- Real‑world case study: Applied to domain‑specific vocabularies (e.g., medical vs. legal jargon), the method surfaces words whose breadth truly varies across domains, aiding dictionary curation.
Practical Implications
- Thesaurus and dictionary building: Lexicographers can automatically flag words that need finer sense splits or merging, based on statistically sound breadth comparisons.
- Domain adaptation: NLP engineers can identify which terms exhibit broader usage in a target domain, informing vocabulary selection for specialized language models.
- Bias and fairness audits: By measuring how evenly a word’s meaning spreads across demographic contexts, teams can spot subtle representation gaps.
- Feature engineering: Dispersion metrics, now reliably testable, can be added as features in downstream tasks such as word sense disambiguation or semantic similarity scoring.
- Scalable research: The GPU implementation makes it feasible to run thousands of breadth comparisons across large corpora, opening doors for large‑scale linguistic analyses.
Limitations & Future Work
- Dependence on embedding quality: The test inherits any biases or noise present in the underlying contextual model; poor embeddings could mask true breadth differences.
- Assumption of isotropy: Aligning only the means may not fully account for more complex shape differences (e.g., anisotropic covariances) that could still affect dispersion estimates.
- Permutation budget: While GPU batching speeds up computation, very high permutation counts (needed for extremely low p‑values) can still be costly.
- Future directions suggested by the author include extending the alignment to match higher‑order moments, integrating adaptive permutation schemes, and evaluating the method on multilingual embeddings.
Authors
- Yo Ehara
Paper Information
- arXiv ID: 2605.08048v1
- Categories: cs.CL
- Published: May 8, 2026
- PDF: Download PDF