[Paper] Language and Geometry Grounded Sparse Voxel Representations for Holistic Scene Understanding
Source: arXiv - 2602.15734v1
Overview
A new paper from Guile Wu et al. tackles a long‑standing gap in 3‑D scene understanding: most open‑vocabulary methods pull language cues from 2‑D vision models but ignore how those cues should interact with the scene’s actual geometry. By grounding sparse voxel representations in both language and geometry, the authors deliver a unified model that simultaneously reasons about appearance, semantics, and 3‑D structure—leading to more accurate scene reconstruction and richer, language‑driven queries.
Key Contributions
- Sparse‑voxel primitive framework that hosts four complementary fields: appearance, density, semantic feature, and confidence.
- Feature‑modulation module that tightly couples appearance, density, and semantic features, ensuring they reinforce each other during learning.
- Dual‑distillation pipeline:
- Language distillation from a 2‑D foundation model (e.g., CLIP) into the 3‑D feature field.
- Geometry distillation from a geometry‑focused foundation model using depth‑correlation and pattern‑consistency regularizers.
- Unified training objective that balances visual fidelity, semantic alignment, and geometric correctness.
- State‑of‑the‑art results on holistic scene understanding benchmarks, outperforming prior methods in both semantic segmentation and reconstruction quality.
Methodology
-
Sparse Voxel Representation – The scene is partitioned into a sparse 3‑D grid of voxels. Each voxel stores:
- Appearance (RGB color)
- Density (occupancy for volume rendering)
- Feature (high‑dimensional semantic embedding)
- Confidence (how reliable the voxel’s information is)
-
Feature Modulation – A lightweight MLP takes the appearance and density values as gates that modulate the semantic feature vector. This encourages the three fields to evolve together rather than in isolation.
-
Language Distillation – Images of the scene are fed to a pretrained 2‑D vision‑language model (e.g., CLIP). The resulting text‑aligned embeddings are projected onto the voxel feature field via a contrastive loss, teaching voxels to carry open‑vocabulary semantics.
-
Geometry Distillation – A separate geometry foundation model provides depth maps and surface normal cues. Two regularizers align the voxel‑derived depth (via volume rendering) with the teacher depth (depth‑correlation) and enforce consistent local patterns (pattern‑consistency), transferring geometric priors into the voxel features.
-
Training Loop – The model optimizes a combined loss: rendering photometric error, semantic contrastive loss, depth‑correlation loss, pattern‑consistency loss, and a confidence‑weighted sparsity term that prunes irrelevant voxels.
Results & Findings
- Semantic Accuracy – On the ScanNet‑200 benchmark, the method improves mean IoU by ~4 % over the previous best open‑vocabulary approach.
- Reconstruction Quality – PSNR and Chamfer‑L1 distance show a 7 % boost in geometric fidelity, indicating tighter alignment with true scene shape.
- Ablation Studies – Removing the geometry distillation drops semantic IoU by 2 % and reconstruction PSNR by 1.5 dB, confirming the synergistic effect of geometry and language.
- Efficiency – Sparse voxel storage keeps memory usage comparable to dense NeRF‑style models while delivering faster inference (≈2× speedup on a single RTX 4090).
Practical Implications
- Enhanced AR/VR Content Creation – Developers can generate semantically rich 3‑D assets from ordinary RGB‑D scans, enabling natural‑language search (“find the red chair”) directly inside virtual environments.
- Robotics & Autonomous Navigation – Robots can query the map with language (“where is the nearest exit?”) while still relying on accurate geometry for path planning.
- Asset Management for Game Engines – Game studios can ingest scanned environments and instantly obtain both high‑quality meshes and searchable semantic tags, cutting down manual labeling time.
- Cross‑modal Retrieval – The unified feature field makes it straightforward to index scenes for multimodal retrieval (e.g., “show me all rooms with a window facing east”).
Limitations & Future Work
- Dependence on 2‑D Teacher Quality – The semantic richness is bounded by the capabilities of the underlying 2‑D vision‑language model; rare or domain‑specific concepts may still be missed.
- Sparse Voxel Resolution Trade‑off – While memory‑efficient, very fine geometric details (e.g., thin wires) can be lost unless the voxel grid is heavily up‑sampled, which impacts speed.
- Limited Real‑World Evaluation – Experiments focus on indoor benchmarks; extending to large‑scale outdoor scenes or dynamic environments remains an open challenge.
- Future Directions – The authors suggest integrating temporal cues for dynamic scenes, exploring larger multimodal teachers (e.g., video‑language models), and developing adaptive voxel sparsity schemes that allocate resolution where semantics or geometry demand it.
Authors
- Guile Wu
- David Huang
- Bingbing Liu
- Dongfeng Bai
Paper Information
- arXiv ID: 2602.15734v1
- Categories: cs.CV
- Published: February 17, 2026
- PDF: Download PDF