[Paper] ZipSplat: Fewer Gaussians, Better Splats

Published: (June 3, 2026 at 01:04 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2606.05102v1

Overview

ZipSplat introduces a token‑based, feed‑forward pipeline for 3D Gaussian splatting that decouples the number of Gaussians from the input image resolution. By clustering visual tokens into a compact set of scene tokens, the method can represent a scene with far fewer Gaussians while preserving—or even improving—rendering quality, all without needing ground‑truth camera poses or intrinsics.

Key Contributions

  • Token‑driven Gaussian placement: Replaces the naïve “one‑Gaussian‑per‑pixel” strategy with a clustering step that allocates Gaussians where the scene actually needs them.
  • Single model for the whole quality‑efficiency curve: Because clustering is performed at inference time, the same trained network can be run with different numbers of clusters, trading speed for fidelity without retraining.
  • Pose‑free training: The backbone learns dense visual tokens directly from multi‑view images, eliminating the need for calibrated camera parameters.
  • State‑of‑the‑art results with ~6× fewer Gaussians: Sets new benchmarks on DL3DV and RealEstate10K, and generalizes zero‑shot to Mip‑NeRF360 and ScanNet++.
  • Lightweight decoding: A small MLP turns each scene token into a small group of Gaussians with unrestricted 3D positions, keeping inference fast.

Methodology

  1. Multi‑view feature extraction: A shared CNN (or transformer) processes all input images simultaneously, producing dense per‑pixel visual tokens that encode color, texture, and coarse geometry cues.
  2. K‑means clustering (inference‑time): The dense token map is flattened and clustered into N scene tokens (the user‑controlled budget). This step compresses redundant information—e.g., flat walls—while preserving detail‑rich regions.
  3. Cross‑ and self‑attention refinement: The scene tokens attend to each other and to the original visual tokens, allowing global context (e.g., object boundaries) to be incorporated.
  4. Gaussian decoding: A lightweight MLP takes each refined scene token and predicts a small set of 3D Gaussians (position, covariance, color, opacity). The Gaussians are unconstrained—they can sit anywhere in space, not just on a pixel grid.
  5. Rendering: Standard splatting renders the Gaussian cloud from arbitrary viewpoints, yielding a novel‑view image.

Because the clustering step is separate from the learned network, developers can simply change the number of clusters to meet memory or latency constraints.

Results & Findings

DatasetPSNR (dB)Gaussians (× fewer)Relative gain vs. pixel‑aligned
DL3DV+2.1 over best pose‑free baseline~6× fewerNew SOTA
RealEstate10K+1.2 over best pose‑free baseline~6× fewerNew SOTA
Mip‑NeRF360 (zero‑shot)Competitive / superior to baselinesSame modelDemonstrates strong generalization
ScanNet++ (zero‑shot)CompetitiveSame modelShows robustness to indoor scans

Key takeaways

  • Quality does not degrade with fewer Gaussians; in many cases it improves because the representation focuses on geometrically important regions.
  • Inference flexibility: By varying the number of clusters, developers can trade off rendering speed vs. visual fidelity on the fly.
  • No pose requirement simplifies data collection pipelines for AR/VR or robotics where accurate calibration is hard to obtain.

Practical Implications

  • Faster, lighter 3D assets for AR/VR: Generate high‑quality Gaussian splats that fit comfortably on mobile GPUs, enabling real‑time view synthesis with a fraction of the memory footprint of traditional NeRF‑style pipelines.
  • Simplified capture pipelines: ZipSplat works without known camera poses, allowing hobbyist photogrammetry apps to skip calibration steps and lower the barrier for user‑generated 3D content.
  • Scalable cloud rendering: Cloud services can allocate fewer Gaussians per scene when serving many concurrent users, reducing bandwidth and compute costs while preserving visual quality.
  • Dynamic level‑of‑detail (LOD): The clustering budget can be adjusted per frame or per device, making adaptive LOD strategies straightforward to implement in games or simulations.
  • Cross‑domain transfer: Zero‑shot performance on unseen datasets suggests a single pre‑trained ZipSplat model could be shipped with SDKs, handling a wide variety of indoor/outdoor environments out of the box.

Limitations & Future Work

  • Clustering overhead: Although lightweight, the K‑means step adds a non‑trivial CPU cost at inference, which may be a bottleneck for ultra‑low‑latency applications.
  • Fixed token dimensionality: The current backbone produces a single token resolution; exploring multi‑scale tokens could further improve detail capture on very large scenes.
  • Handling extreme view extrapolation: Like most splatting methods, ZipSplat may struggle with viewpoints far outside the training camera frustum, where Gaussian density becomes sparse.
  • Future directions: The authors suggest integrating learned clustering (e.g., differentiable pooling) to eliminate the explicit K‑means step, and extending the framework to support dynamic scenes or temporal consistency for video‑based capture.

Authors

  • Alexander Veicht
  • Sunghwan Hong
  • Dániel Baráth
  • Marc Pollefeys

Paper Information

  • arXiv ID: 2606.05102v1
  • Categories: cs.CV
  • Published: June 3, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »