[Paper] Native and Compact Structured Latents for 3D Generation

Published: 1 month ago (December 16, 2025 at 01:58 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.14692v1

Overview

The paper introduces O‑Voxel, a novel sparse voxel format that natively stores both geometry and rich surface attributes (e.g., material parameters) for 3D objects. By coupling O‑Voxel with a highly compressed variational auto‑encoder (Sparse Compression VAE) and a 4‑billion‑parameter flow‑matching generator, the authors achieve state‑of‑the‑art realism on assets with complex, non‑manifold topologies while keeping inference fast enough for practical use.

Key Contributions

O‑Voxel representation – an “omni‑voxel” data structure that simultaneously encodes occupancy, surface normals, and physically‑based rendering (PBR) material maps in a sparse format.
Sparse Compression VAE – a VAE that aggressively compresses the high‑dimensional O‑Voxel grid into a compact latent vector without sacrificing detail.
Large‑scale flow‑matching generator – a 4 B‑parameter model trained on multiple public 3D asset collections, capable of unconditional 3D generation at inference speeds comparable to lightweight voxel decoders.
Demonstrated superiority – quantitative and qualitative evaluations show markedly higher geometry fidelity and material realism than prior voxel‑based or implicit‑field generators.
Open‑source pipeline – the authors release code, pretrained weights, and conversion tools for existing mesh/point‑cloud datasets into O‑Voxel, facilitating reproducibility.

Methodology

Data preparation – Raw meshes and point clouds are voxelized into a sparse 3‑D grid. Each occupied cell stores a small set of channels: binary occupancy, surface normal, albedo, roughness, metallic, and emissivity. The sparsity is exploited via a hash‑based octree that only materializes active voxels.
Sparse Compression VAE
- Encoder: A series of sparse 3‑D convolutions (implemented with MinkowskiEngine) compress the O‑Voxel into a latent vector (≈128‑dim).
- Decoder: Mirrors the encoder, reconstructing the full O‑Voxel from the latent code. A learned quantization step encourages compactness.
- Training loss: Combines a standard VAE KL term, a reconstruction loss on each channel (L2 for continuous attributes, BCE for occupancy), and a perceptual geometry loss that penalizes surface deviation.
Flow‑matching generator – Instead of a GAN, the authors adopt a continuous normalizing flow formulation. The model learns to map a simple Gaussian latent distribution to the compressed latent space of the VAE via a time‑dependent neural ODE. This yields stable training at massive scale.
Inference – Sample a Gaussian vector, run the flow‑matching network to obtain a latent code, decode with the VAE, and finally convert the O‑Voxel back to a mesh (e.g., marching cubes) with PBR material textures ready for real‑time rendering.

Results & Findings

Geometry quality – Chamfer Distance (CD) improves by ~35 % over the best prior voxel‑GAN on the ShapeNetCore benchmark; the method also handles open surfaces and non‑manifold edges that earlier implicit methods cannot represent.
Material realism – Measured via a learned material similarity metric, O‑Voxel assets achieve a 0.22 reduction in error compared to baseline neural‑SDF approaches that only output color.
Compression – The Sparse Compression VAE reduces storage from ~10 MB per high‑resolution O‑Voxel to ~200 KB per latent vector (≈50× compression) while preserving visual fidelity.
Speed – End‑to‑end generation (sampling + decoding) runs at ~30 ms on a single RTX 4090, comparable to lightweight point‑cloud generators and far faster than full implicit field solvers (≈300 ms).
Scalability – Training on 2 M diverse assets (chairs, vehicles, characters) demonstrates that the flow‑matching model does not suffer from mode collapse and can synthesize novel categories not seen during training.

Practical Implications

Game & VR asset pipelines – Developers can generate high‑quality, physically‑based 3‑D assets on‑the‑fly, dramatically reducing manual modeling time for background props or procedural worlds.
AR content creation – The compact latent vectors enable streaming of 3‑D assets over bandwidth‑limited networks; the decoder can run on edge GPUs to reconstruct full‑fidelity models in real time.
Digital twins & simulation – Accurate geometry + material parameters make the generated assets suitable for physics‑based simulation (e.g., lighting, collision) without a separate material authoring step.
Data augmentation for downstream tasks – Synthetic O‑Voxel assets can be converted back to meshes/point clouds to enrich training data for detection, segmentation, or pose estimation models.
Tooling integration – Because O‑Voxel is a sparse voxel grid, it plugs directly into existing voxel‑based engines (e.g., Unity’s Voxel Terrain, NVIDIA’s Omniverse) and can be converted to standard formats (OBJ/GLTF) with minimal loss.

Limitations & Future Work

Resolution trade‑off – While sparsity mitigates memory, extremely fine details (sub‑millimeter) still require higher voxel resolutions, which can increase inference time.
Material scope – The current channel set covers basic PBR parameters; more exotic effects (subsurface scattering, anisotropy) are not yet encoded.
Conditional generation – The model is primarily unconditional; extending it to accept textual prompts or semantic sketches would broaden applicability.
Cross‑modal consistency – Aligning generated geometry with corresponding texture atlases or animation rigs remains an open challenge.

The authors suggest exploring hierarchical O‑Voxel structures, richer material encodings, and multimodal conditioning as next steps.

Authors

Jianfeng Xiang
Xiaoxue Chen
Sicheng Xu
Ruicheng Wang
Zelong Lv
Yu Deng
Hongyuan Zhu
Yue Dong
Hao Zhao
Nicholas Jing Yuan
Jiaolong Yang

Paper Information

arXiv ID: 2512.14692v1
Categories: cs.CV, cs.AI
Published: December 16, 2025
PDF: Download PDF

[Paper] Native and Compact Structured Latents for 3D Generation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Adversarial Robustness of Vision in Open Foundation Models

[Paper] RadarGen: Automotive Radar Point Cloud Generation from Cameras

[Paper] Visually Prompted Benchmarks Are Surprisingly Fragile