[Paper] TaxoBell: Gaussian Box Embeddings for Self-Supervised Taxonomy Expansion

Published: (January 14, 2026 at 12:08 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2601.09633v1

Overview

TaxoBell introduces a novel way to grow taxonomies automatically by representing concepts as Gaussian‑parameterized boxes rather than ordinary point vectors. By marrying box geometry with multivariate Gaussian distributions, the model captures both hierarchical “is‑a” relations and the uncertainty that real‑world concepts often exhibit, delivering a sizable boost in taxonomy‑expansion performance.

Key Contributions

  • Gaussian Box Embeddings: A unified representation that maps each box to a Gaussian (mean + covariance), enabling containment (hypernym‑hyponym) and uncertainty modeling.
  • Stable Energy‑Based Training: An energy function that avoids gradient explosions at box intersections, ensuring reliable convergence.
  • Handling Polysemy & Ambiguity: Covariance matrices naturally encode semantic spread, allowing a single node to reflect multiple senses.
  • State‑of‑the‑Art Empirical Gains: Outperforms eight recent taxonomy‑expansion baselines by ~19 % MRR and ~25 % Recall@k on five benchmark datasets.
  • Comprehensive Analysis: Includes error breakdowns, ablation studies, and visualizations that illustrate how Gaussian uncertainty improves hierarchical reasoning.

Methodology

  1. Embedding Space: Each taxonomy node is assigned a box in a high‑dimensional Euclidean space. The box’s lower‑left and upper‑right corners are derived from a Gaussian’s mean vector (center) and covariance matrix (shape).

  2. Containment as Hierarchy: A hypernym’s box fully contains its hyponyms’ boxes. The probability that a point sampled from a child Gaussian lies inside the parent box is used as a containment score.

  3. Energy Function:

    [ \mathcal{E}(c, p) = -\log \Pr\big[,\mathbf{x}\sim\mathcal{N}(\mu_c,\Sigma_c) \in \text{Box}(p),\big] ]

    where (c) is a child, (p) a candidate parent. Minimizing this energy pushes child boxes inside parent boxes while respecting uncertainty.

  4. Self‑Supervised Signal: The model starts from a seed taxonomy and treats existing parent‑child links as positive pairs; all other pairs are treated as negatives. No external labels are required.

  5. Optimization: Stochastic gradient descent with a soft‑intersection trick (using smooth approximations of min/max) yields stable gradients even when boxes barely touch. Covariance matrices are constrained to stay positive‑definite via a Cholesky parameterization.

Results & Findings

DatasetMRR (TaxoBell)Δ vs. Best BaselineRecall@5Δ vs. Best Baseline
DBpedia‑Animals0.71+0.190.84+0.26
WordNet‑Nouns0.68+0.180.81+0.24
E‑Commerce (Amazon)0.73+0.200.86+0.27
PubMed‑MeSH0.66+0.170.78+0.22
OpenCyc0.69+0.190.82+0.25
  • Uncertainty matters: Nodes with high covariance (e.g., “apple” covering both fruit and company) correctly attach to multiple plausible parents, reducing false negatives.
  • Ablation: Removing the covariance term drops MRR by ~7 %; replacing the Gaussian‑box mapping with plain boxes loses ~10 % of Recall@k.
  • Error analysis: Most remaining errors stem from extremely sparse concepts where contextual clues are insufficient, not from the embedding geometry.

Practical Implications

  • E‑commerce catalog automation: Retail platforms can ingest new product titles and instantly place them into the correct category hierarchy, cutting manual curation time by weeks.
  • Semantic search & recommendation: Search engines can leverage the learned containment scores to expand query concepts on‑the‑fly, improving recall without sacrificing precision.
  • Knowledge‑graph maintenance: Enterprises maintaining large ontologies (e.g., biomedical vocabularies) can use TaxoBell to suggest new “is‑a” links, flagging ambiguous terms for human review.
  • API‑friendly implementation: The authors release a PyTorch library that exposes embed(term)(mean, cov) and score(child, parent) functions, making integration into existing pipelines straightforward.

Limitations & Future Work

  • Scalability of full covariance: Storing a dense (d \times d) covariance for each node can be memory‑heavy; the current implementation uses diagonal covariances, which may limit expressiveness for highly correlated dimensions.
  • Dependence on seed taxonomy quality: Noisy or incomplete seed hierarchies can propagate errors; future work could incorporate noise‑robust loss functions or external textual cues.
  • Cross‑lingual extension: The current experiments are monolingual; extending Gaussian box embeddings to multilingual taxonomies is an open research direction.

TaxoBell demonstrates that marrying geometric containment with probabilistic uncertainty yields a powerful, developer‑friendly tool for scaling taxonomies in real‑world systems.

Authors

  • Sahil Mishra
  • Srinitish Srinivasan
  • Srikanta Bedathur
  • Tanmoy Chakraborty

Paper Information

  • arXiv ID: 2601.09633v1
  • Categories: cs.CL
  • Published: January 14, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »