[Paper] Splatent: Splatting Diffusion Latents for Novel View Synthesis

Published: (December 10, 2025 at 01:57 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.09923v1

Overview

The paper introduces Splatent, a diffusion‑based post‑processing pipeline that sharpens the output of 3D Gaussian Splatting (3DGS) when the underlying representation lives in the latent space of a pretrained VAE. By moving the fine‑detail recovery step from 3D back to the original 2D image views, the authors achieve far better texture fidelity without sacrificing the speed and scalability of latent‑space radiance fields.

Key Contributions

  • Latent‑space diffusion on top of 3DGS: A novel framework that treats the VAE latent field as a canvas for a diffusion model, preserving the compactness of latent radiance fields while adding high‑frequency detail.
  • Multi‑view attention for 2D detail recovery: Instead of trying to reconstruct missing texture in 3‑D, the method aggregates information from all input views using attention, then injects the recovered detail back into the latent field.
  • State‑of‑the‑art results on standard benchmarks: Splatent outperforms prior VAE‑latent radiance field methods on PSNR, SSIM, and LPIPS, setting a new best‑in‑class for sparse‑view novel view synthesis.
  • Plug‑and‑play compatibility: The approach can be attached to existing feed‑forward 3DGS pipelines (e.g., Instant‑NGP, Gaussian‑Splatting) and consistently improves visual quality with minimal extra compute.
  • Preservation of pretrained VAE quality: No fine‑tuning of the VAE is required, avoiding the typical trade‑off between multi‑view consistency and reconstruction fidelity.

Methodology

  1. Base 3DGS in latent space – A pretrained VAE encodes input images into a low‑dimensional latent grid. 3D Gaussian splatting is performed on this grid to obtain a coarse radiance field that can be rendered quickly from any viewpoint.
  2. Diffusion enhancement module – A conditional diffusion model takes the rendered coarse view (still in latent space) and a set of neighboring source views as conditioning.
  3. Multi‑view attention – The conditioning uses a transformer‑style attention block that lets the diffusion model query texture cues from all available views, effectively “borrowing” high‑frequency information that was lost during VAE compression.
  4. Latent update & re‑render – The diffusion step predicts a residual latent map, which is added to the original latent field. The updated latent field is then splatted again, yielding a high‑detail novel view.
  5. Training – The diffusion model is trained on synthetic pairs of (coarse latent render, ground‑truth latent) generated from existing multi‑view datasets. The VAE remains frozen throughout.

The pipeline can be visualized as: Input images → VAE encoder → latent 3DGS → coarse render → diffusion + attention → refined latent → 3DGS render.

Results & Findings

DatasetPSNR ↑SSIM ↑LPIPS ↓
NeRF‑Synthetic (8 views)31.20.940.07
Tanks & Temples (sparse)28.50.910.09
ScanNet (4‑view)29.80.920.08
  • Texture fidelity: Visual comparisons show crisp edges and restored fine patterns (e.g., fabric weave, brick mortar) that were blurred in the baseline latent‑3DGS.
  • Speed: The diffusion step adds ~0.2 s per view on an RTX 4090, still far faster than full‑resolution NeRF training (hours).
  • Robustness to sparsity: Even with as few as 3 input views, Splatent recovers details that other latent‑field methods completely miss.

Overall, Splatent achieves a ~1.5 dB PSNR gain over the strongest prior latent‑radiance approach while keeping the same memory footprint.

Practical Implications

  • Rapid prototyping of AR/VR assets: Developers can generate high‑quality 3‑D assets from a handful of photos without waiting for days‑long NeRF training.
  • Integration with existing pipelines: Since the diffusion module is a drop‑in post‑processor, studios using Gaussian‑Splatting for real‑time rendering can upgrade texture quality with a single extra inference pass.
  • Edge‑device feasibility: The latent representation stays compact, enabling on‑device inference for mobile or embedded AR headsets; only the diffusion step may be offloaded to a server if needed.
  • Improved downstream tasks: Better texture reconstruction benefits tasks such as photorealistic relighting, texture‑aware collision detection, and data‑augmentation for computer‑vision models.

Limitations & Future Work

  • Dependence on view coverage: While Splatent works with very sparse inputs, extreme view gaps (e.g., back of an object never seen) still lead to hallucinations, a known diffusion risk.
  • Computational overhead of diffusion: Although modest, the extra diffusion pass can be a bottleneck for real‑time streaming scenarios; future work could explore lightweight diffusion or distillation.
  • Fixed VAE latent dimensionality: The method assumes a pretrained VAE; exploring joint optimization of VAE and diffusion could further push quality.
  • Generalization to non‑photorealistic domains: The current training data is mostly indoor/outdoor photography; extending to medical imaging or scientific visualization remains an open direction.

Authors

  • Or Hirschorn
  • Omer Sela
  • Inbar Huberman‑Spiegelglas
  • Netalee Efrat
  • Eli Alshan
  • Ianir Ideses
  • Frederic Devernay
  • Yochai Zvik
  • Lior Fritz

Paper Information

  • arXiv ID: 2512.09923v1
  • Categories: cs.CV
  • Published: December 10, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »