[Paper] TexSpot: 3D Texture Enhancement with Spatially-uniform Point Latent Representation
Source: arXiv - 2602.12157v1
Overview
TexSpot tackles a long‑standing pain point in 3‑D graphics: generating high‑quality, view‑consistent textures for arbitrary meshes. By introducing a new “Texlet” representation that blends the flexibility of point‑based textures with the compactness of UV maps, the authors build a diffusion‑based enhancer that can polish textures produced by existing multi‑view pipelines while preserving geometric fidelity.
Key Contributions
- Texlet representation: A spatially‑uniform latent token for each surface point that stores a local 2‑D texture patch, learned via a joint 2‑D/3‑D encoder pipeline.
- Cascaded 3‑D‑to‑2‑D decoder: Reconstructs high‑resolution texture patches from Texlet latents, enabling a compact yet expressive texture space.
- Diffusion transformer for enhancement: Trains a diffusion model conditioned on Texlets to refine textures generated by any multi‑view diffusion method, improving consistency across viewpoints.
- Comprehensive evaluation: Demonstrates superior visual fidelity, geometric consistency, and robustness compared with state‑of‑the‑art 3‑D texture generation and enhancement techniques.
Methodology
-
Texlet Construction
- Sample a uniform set of points on the mesh surface.
- For each point, extract a small 2‑D texture patch (e.g., 32×32 pixels) from the initial texture.
- Encode each patch with a lightweight 2‑D CNN encoder → local latent vector.
- Feed all local vectors into a shared 3‑D encoder (e.g., PointNet++ style) that injects global shape context, producing the final Texlet latent for that point.
-
3‑D‑to‑2‑D Decoding
- A cascade of decoders first expands the global latent into a coarse 2‑D feature map, then refines it into the full‑resolution texture patch.
- This design keeps memory usage low while allowing the model to reconstruct fine‑grained details.
-
Diffusion‑Based Enhancement
- A transformer‑style diffusion model receives the noisy Texlet latents and learns to denoise them conditioned on the underlying geometry.
- The diffusion process iteratively refines the latent space, which is then decoded back into high‑quality texture patches.
- Because the diffusion operates on the compact Texlet space, the method is fast and scalable to high‑resolution meshes.
-
Training & Integration
- The system is trained end‑to‑end on a curated dataset of meshes with ground‑truth textures.
- At inference, TexSpot can be plugged after any multi‑view diffusion generator (e.g., DreamFusion‑style pipelines) to boost the final texture quality.
Results & Findings
- Visual fidelity: User studies and PSNR/SSIM metrics show a 15‑20 % improvement over the best prior point‑based and UV‑based methods.
- View consistency: Renderings from drastically different camera angles exhibit far fewer seams and color shifts, confirming the spatial uniformity of Texlets.
- Resolution scalability: TexSpot successfully generates textures up to 4K resolution without exploding memory, thanks to the latent compression.
- Robustness: The diffusion enhancer tolerates noisy or incomplete initial textures (e.g., from low‑sample multi‑view diffusion) and still converges to clean results.
Practical Implications
- Game & VR asset pipelines: Artists can feed coarse textures from rapid prototyping tools into TexSpot to obtain production‑grade, view‑consistent textures without manual UV unwrapping.
- 3‑D content marketplaces: Automated up‑scaling of user‑submitted meshes becomes feasible, reducing the need for manual retouching.
- AR/VR streaming: Because TexSpot works on a compact latent representation, it can be integrated into edge‑computing scenarios where bandwidth is limited but high‑quality textures are required.
- Cross‑modal generation: The Texlet space could serve as a bridge for text‑to‑3‑D pipelines, enabling language‑driven texture refinement without re‑training a full diffusion model for each new asset.
Limitations & Future Work
- Dependence on point density: Extremely sparse point samplings still limit the finest texture details; adaptive sampling strategies could mitigate this.
- Training data bias: The model is trained on synthetic datasets with relatively clean geometry; performance on noisy real‑world scans may degrade.
- Real‑time constraints: While more efficient than full‑resolution diffusion, the iterative diffusion steps still add latency, suggesting future work on accelerated denoising (e.g., distilled diffusion or GAN‑based shortcuts).
- Extension to dynamic meshes: Current formulation assumes static geometry; extending Texlets to handle deformable or animated surfaces is an open direction.
Authors
- Ziteng Lu
- Yushuang Wu
- Chongjie Ye
- Yuda Qiu
- Jing Shao
- Xiaoyang Guo
- Jiaqing Zhou
- Tianlei Hu
- Kun Zhou
- Xiaoguang Han
Paper Information
- arXiv ID: 2602.12157v1
- Categories: cs.CV, cs.GR
- Published: February 12, 2026
- PDF: Download PDF