[Paper] LoST: Level of Semantics Tokenization for 3D Shapes

Published: (March 18, 2026 at 01:56 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2603.17995v1

Overview

The paper introduces LoST (Level‑of‑Semantics Tokenization), a new way to break down 3‑D shapes into discrete tokens that are ordered by their semantic importance rather than just geometric detail. By doing so, early tokens already capture the “big picture” of an object, while later tokens add finer geometric nuances. This semantic‑first tokenization dramatically improves the efficiency and quality of autoregressive (AR) 3‑D generative models.

Key Contributions

  • Semantic‑driven token ordering: Tokens are arranged from coarse, semantically meaningful components to fine‑grained geometry, enabling early decoding of plausible shapes.
  • Relational Inter‑Distance Alignment (RIDA): A novel loss that aligns the relational structure of a shape’s latent space with that of DINO‑derived semantic features, ensuring semantic coherence across tokens.
  • State‑of‑the‑art reconstruction: LoST outperforms prior level‑of‑detail (LoD) tokenizers on both geometric fidelity (e.g., Chamfer Distance) and semantic consistency metrics.
  • Token efficiency: Achieves comparable or better results using only 0.1 %–10 % of the tokens required by existing AR 3‑D models.
  • Downstream utility: Demonstrates that the learned tokens support tasks such as semantic shape retrieval without additional fine‑tuning.

Methodology

  1. Semantic Feature Extraction – Each 3‑D shape is rendered from multiple views and processed by a pretrained DINO vision transformer to obtain a high‑level semantic descriptor.
  2. Latent Space Construction – A variational auto‑encoder (VAE) encodes the raw mesh into a latent vector.
  3. RIDA Loss – The pairwise distances between latent vectors of different shapes are forced to match the pairwise distances between their DINO semantic descriptors. This aligns the geometry‑latent space with the semantic space, encouraging the VAE to preserve semantic relationships.
  4. Token Sequencing – The latent vector is quantized into a sequence of discrete tokens. Tokens are sorted by semantic salience (derived from the RIDA‑aligned latent space), so the first few tokens already reconstruct a coarse, semantically correct shape.
  5. Autoregressive Generation – An AR transformer predicts the token sequence. Because early tokens carry most of the semantic content, the model can generate recognizable shapes after only a few steps, refining details as more tokens are sampled.

Results & Findings

MetricLoST vs. LoD‑based baselines
Chamfer Distance (lower is better)~30 % improvement
Semantic Consistency (higher is better)~45 % improvement
Tokens per shape0.1 %–10 % of prior AR models
Generation speed (tokens/second)~5× faster due to shorter sequences

Qualitatively, shapes generated after just 5–10 tokens already resemble the target class (e.g., a chair’s backrest and seat), whereas LoD‑based methods need dozens of tokens before the object becomes recognizable. The authors also show that using the LoST token embeddings for nearest‑neighbor search yields more semantically accurate retrieval results than raw geometry‑based descriptors.

Practical Implications

  • Faster 3‑D content pipelines – Game studios and AR/VR developers can generate high‑quality assets on‑the‑fly with far fewer compute cycles, reducing cloud costs.
  • Progressive streaming – Because early tokens convey a usable coarse model, applications can stream a low‑resolution but semantically correct shape first, then refine it client‑side as more tokens arrive.
  • Semantic search & indexing – Asset libraries can index LoST tokens for rapid, meaning‑aware retrieval, improving designers’ workflow when looking for “modern office chairs” versus “vintage stools.”
  • Compact storage – Storing only the token sequence (instead of full meshes) can shrink 3‑D model databases by orders of magnitude, beneficial for mobile or edge devices.
  • Better AR generative tools – Artists using text‑to‑3‑D or sketch‑to‑3‑D systems can get semantically coherent outputs faster, as the model already knows the high‑level shape early in the generation process.

Limitations & Future Work

  • Dependence on 2‑D semantic features – RIDA leverages DINO features extracted from rendered views; any bias or failure in the 2‑D encoder propagates to the 3‑D tokenization.
  • Scalability to highly complex scenes – The current experiments focus on single objects; extending LoST to whole scenes with multiple interacting entities remains an open challenge.
  • Resolution of fine details – While the token count is drastically reduced, ultra‑fine geometric nuances (e.g., intricate carvings) may still require additional tokens or a hybrid approach.
  • Generalization across domains – The method was evaluated on common shape datasets (e.g., ShapeNet). Future work could explore domain adaptation to CAD models, medical scans, or point‑cloud‑only data.

Overall, LoST opens a promising path toward more semantically aware and efficient 3‑D generative pipelines, bridging the gap between high‑level understanding and low‑level geometry.

Authors

  • Niladri Shekhar Dutt
  • Zifan Shi
  • Paul Guerrero
  • Chun‑Hao Paul Huang
  • Duygu Ceylan
  • Niloy J. Mitra
  • Xuelin Chen

Paper Information

  • arXiv ID: 2603.17995v1
  • Categories: cs.CV, cs.GR, cs.LG
  • Published: March 18, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »