[Paper] Native and Compact Structured Latents for 3D Generation
Source: arXiv - 2512.14692v1
Overview
The paper introduces O‑Voxel, a novel sparse voxel format that natively stores both geometry and rich surface attributes (e.g., material parameters) for 3D objects. By coupling O‑Voxel with a highly compressed variational auto‑encoder (Sparse Compression VAE) and a 4‑billion‑parameter flow‑matching generator, the authors achieve state‑of‑the‑art realism on assets with complex, non‑manifold topologies while keeping inference fast enough for practical use.
Key Contributions
- O‑Voxel representation – an “omni‑voxel” data structure that simultaneously encodes occupancy, surface normals, and physically‑based rendering (PBR) material maps in a sparse format.
- Sparse Compression VAE – a VAE that aggressively compresses the high‑dimensional O‑Voxel grid into a compact latent vector without sacrificing detail.
- Large‑scale flow‑matching generator – a 4 B‑parameter model trained on multiple public 3D asset collections, capable of unconditional 3D generation at inference speeds comparable to lightweight voxel decoders.
- Demonstrated superiority – quantitative and qualitative evaluations show markedly higher geometry fidelity and material realism than prior voxel‑based or implicit‑field generators.
- Open‑source pipeline – the authors release code, pretrained weights, and conversion tools for existing mesh/point‑cloud datasets into O‑Voxel, facilitating reproducibility.
Methodology
- Data preparation – Raw meshes and point clouds are voxelized into a sparse 3‑D grid. Each occupied cell stores a small set of channels: binary occupancy, surface normal, albedo, roughness, metallic, and emissivity. The sparsity is exploited via a hash‑based octree that only materializes active voxels.
- Sparse Compression VAE
- Encoder: A series of sparse 3‑D convolutions (implemented with MinkowskiEngine) compress the O‑Voxel into a latent vector (≈128‑dim).
- Decoder: Mirrors the encoder, reconstructing the full O‑Voxel from the latent code. A learned quantization step encourages compactness.
- Training loss: Combines a standard VAE KL term, a reconstruction loss on each channel (L2 for continuous attributes, BCE for occupancy), and a perceptual geometry loss that penalizes surface deviation.
- Flow‑matching generator – Instead of a GAN, the authors adopt a continuous normalizing flow formulation. The model learns to map a simple Gaussian latent distribution to the compressed latent space of the VAE via a time‑dependent neural ODE. This yields stable training at massive scale.
- Inference – Sample a Gaussian vector, run the flow‑matching network to obtain a latent code, decode with the VAE, and finally convert the O‑Voxel back to a mesh (e.g., marching cubes) with PBR material textures ready for real‑time rendering.
Results & Findings
- Geometry quality – Chamfer Distance (CD) improves by ~35 % over the best prior voxel‑GAN on the ShapeNetCore benchmark; the method also handles open surfaces and non‑manifold edges that earlier implicit methods cannot represent.
- Material realism – Measured via a learned material similarity metric, O‑Voxel assets achieve a 0.22 reduction in error compared to baseline neural‑SDF approaches that only output color.
- Compression – The Sparse Compression VAE reduces storage from ~10 MB per high‑resolution O‑Voxel to ~200 KB per latent vector (≈50× compression) while preserving visual fidelity.
- Speed – End‑to‑end generation (sampling + decoding) runs at ~30 ms on a single RTX 4090, comparable to lightweight point‑cloud generators and far faster than full implicit field solvers (≈300 ms).
- Scalability – Training on 2 M diverse assets (chairs, vehicles, characters) demonstrates that the flow‑matching model does not suffer from mode collapse and can synthesize novel categories not seen during training.
Practical Implications
- Game & VR asset pipelines – Developers can generate high‑quality, physically‑based 3‑D assets on‑the‑fly, dramatically reducing manual modeling time for background props or procedural worlds.
- AR content creation – The compact latent vectors enable streaming of 3‑D assets over bandwidth‑limited networks; the decoder can run on edge GPUs to reconstruct full‑fidelity models in real time.
- Digital twins & simulation – Accurate geometry + material parameters make the generated assets suitable for physics‑based simulation (e.g., lighting, collision) without a separate material authoring step.
- Data augmentation for downstream tasks – Synthetic O‑Voxel assets can be converted back to meshes/point clouds to enrich training data for detection, segmentation, or pose estimation models.
- Tooling integration – Because O‑Voxel is a sparse voxel grid, it plugs directly into existing voxel‑based engines (e.g., Unity’s Voxel Terrain, NVIDIA’s Omniverse) and can be converted to standard formats (OBJ/GLTF) with minimal loss.
Limitations & Future Work
- Resolution trade‑off – While sparsity mitigates memory, extremely fine details (sub‑millimeter) still require higher voxel resolutions, which can increase inference time.
- Material scope – The current channel set covers basic PBR parameters; more exotic effects (subsurface scattering, anisotropy) are not yet encoded.
- Conditional generation – The model is primarily unconditional; extending it to accept textual prompts or semantic sketches would broaden applicability.
- Cross‑modal consistency – Aligning generated geometry with corresponding texture atlases or animation rigs remains an open challenge.
The authors suggest exploring hierarchical O‑Voxel structures, richer material encodings, and multimodal conditioning as next steps.
Authors
- Jianfeng Xiang
- Xiaoxue Chen
- Sicheng Xu
- Ruicheng Wang
- Zelong Lv
- Yu Deng
- Hongyuan Zhu
- Yue Dong
- Hao Zhao
- Nicholas Jing Yuan
- Jiaolong Yang
Paper Information
- arXiv ID: 2512.14692v1
- Categories: cs.CV, cs.AI
- Published: December 16, 2025
- PDF: Download PDF