[Paper] Accurate and Efficient Statistical Testing for Word Semantic Breadth

Published: (May 8, 2026 at 01:38 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2605.08048v1

Overview

The paper tackles a subtle but critical problem in modern NLP: how to reliably compare the semantic breadth of two words using contextualized embeddings. While dispersion‑based statistics can capture how widely a word is used across contexts, naïve hypothesis testing can mistake shifts in meaning direction for genuine breadth differences, leading to false‑positive findings. The author introduces a Householder‑aligned permutation test that cleanly separates dispersion from directional effects, delivering statistically sound and computationally efficient comparisons.

Key Contributions

  • Novel statistical test: A permutation test that first aligns the mean vectors of two word token clouds via a single Householder reflection, isolating pure dispersion differences.
  • GPU‑friendly implementation: Batches the permutation and linear‑algebra steps to run efficiently on modern GPUs, achieving a 23× speedup over a CPU baseline.
  • Empirical validation: Demonstrates a 32.5% reduction in Type‑I error (false positives) while maintaining power to detect real breadth differences.
  • Practical toolkit: Provides open‑source code that can be dropped into existing embedding pipelines for quick semantic‑breadth analysis.

Methodology

  1. Token cloud construction – For each target word, collect all its contextualized token embeddings (e.g., from BERT) across a corpus, forming a high‑dimensional point cloud.
  2. Compute dispersion – Use a simple statistic such as average pairwise distance or covariance trace to quantify how spread out the cloud is.
  3. Householder alignment – Compute a Householder matrix that reflects one cloud onto the other so their mean vectors coincide. This removes any systematic shift in direction while preserving the internal geometry of each cloud.
  4. Permutation test – Randomly shuffle token embeddings between the two aligned clouds many times, recompute the dispersion difference for each shuffle, and build a non‑parametric null distribution.
  5. p‑value extraction – The proportion of shuffled differences that exceed the observed difference yields a calibrated p‑value, free from directional confounds.
    All steps are expressed as matrix operations that can be batched and executed on a GPU, dramatically cutting runtime.

Results & Findings

  • Type‑I error control: On synthetic datasets where the true breadth is identical but means differ, the aligned test cuts false‑positive rates from ~0.10 to ~0.067 (≈32.5% reduction).
  • Statistical power: When genuine breadth differences are introduced, the test retains comparable detection rates to the naïve approach, confirming that alignment does not over‑correct.
  • Performance: Processing 10,000 token embeddings with 10,000 permutations runs in ~0.8 s on an NVIDIA RTX 3090, versus ~18 s on a 16‑core CPU.
  • Real‑world case study: Applied to domain‑specific vocabularies (e.g., medical vs. legal jargon), the method surfaces words whose breadth truly varies across domains, aiding dictionary curation.

Practical Implications

  • Thesaurus and dictionary building: Lexicographers can automatically flag words that need finer sense splits or merging, based on statistically sound breadth comparisons.
  • Domain adaptation: NLP engineers can identify which terms exhibit broader usage in a target domain, informing vocabulary selection for specialized language models.
  • Bias and fairness audits: By measuring how evenly a word’s meaning spreads across demographic contexts, teams can spot subtle representation gaps.
  • Feature engineering: Dispersion metrics, now reliably testable, can be added as features in downstream tasks such as word sense disambiguation or semantic similarity scoring.
  • Scalable research: The GPU implementation makes it feasible to run thousands of breadth comparisons across large corpora, opening doors for large‑scale linguistic analyses.

Limitations & Future Work

  • Dependence on embedding quality: The test inherits any biases or noise present in the underlying contextual model; poor embeddings could mask true breadth differences.
  • Assumption of isotropy: Aligning only the means may not fully account for more complex shape differences (e.g., anisotropic covariances) that could still affect dispersion estimates.
  • Permutation budget: While GPU batching speeds up computation, very high permutation counts (needed for extremely low p‑values) can still be costly.
  • Future directions suggested by the author include extending the alignment to match higher‑order moments, integrating adaptive permutation schemes, and evaluating the method on multilingual embeddings.

Authors

  • Yo Ehara

Paper Information

  • arXiv ID: 2605.08048v1
  • Categories: cs.CL
  • Published: May 8, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »