[Paper] ZipSplat: Fewer Gaussians, Better Splats
Source: arXiv - 2606.05102v1
Overview
ZipSplat introduces a token‑based, feed‑forward pipeline for 3D Gaussian splatting that decouples the number of Gaussians from the input image resolution. By clustering visual tokens into a compact set of scene tokens, the method can represent a scene with far fewer Gaussians while preserving—or even improving—rendering quality, all without needing ground‑truth camera poses or intrinsics.
Key Contributions
- Token‑driven Gaussian placement: Replaces the naïve “one‑Gaussian‑per‑pixel” strategy with a clustering step that allocates Gaussians where the scene actually needs them.
- Single model for the whole quality‑efficiency curve: Because clustering is performed at inference time, the same trained network can be run with different numbers of clusters, trading speed for fidelity without retraining.
- Pose‑free training: The backbone learns dense visual tokens directly from multi‑view images, eliminating the need for calibrated camera parameters.
- State‑of‑the‑art results with ~6× fewer Gaussians: Sets new benchmarks on DL3DV and RealEstate10K, and generalizes zero‑shot to Mip‑NeRF360 and ScanNet++.
- Lightweight decoding: A small MLP turns each scene token into a small group of Gaussians with unrestricted 3D positions, keeping inference fast.
Methodology
- Multi‑view feature extraction: A shared CNN (or transformer) processes all input images simultaneously, producing dense per‑pixel visual tokens that encode color, texture, and coarse geometry cues.
- K‑means clustering (inference‑time): The dense token map is flattened and clustered into N scene tokens (the user‑controlled budget). This step compresses redundant information—e.g., flat walls—while preserving detail‑rich regions.
- Cross‑ and self‑attention refinement: The scene tokens attend to each other and to the original visual tokens, allowing global context (e.g., object boundaries) to be incorporated.
- Gaussian decoding: A lightweight MLP takes each refined scene token and predicts a small set of 3D Gaussians (position, covariance, color, opacity). The Gaussians are unconstrained—they can sit anywhere in space, not just on a pixel grid.
- Rendering: Standard splatting renders the Gaussian cloud from arbitrary viewpoints, yielding a novel‑view image.
Because the clustering step is separate from the learned network, developers can simply change the number of clusters to meet memory or latency constraints.
Results & Findings
| Dataset | PSNR (dB) | Gaussians (× fewer) | Relative gain vs. pixel‑aligned |
|---|---|---|---|
| DL3DV | +2.1 over best pose‑free baseline | ~6× fewer | New SOTA |
| RealEstate10K | +1.2 over best pose‑free baseline | ~6× fewer | New SOTA |
| Mip‑NeRF360 (zero‑shot) | Competitive / superior to baselines | Same model | Demonstrates strong generalization |
| ScanNet++ (zero‑shot) | Competitive | Same model | Shows robustness to indoor scans |
Key takeaways
- Quality does not degrade with fewer Gaussians; in many cases it improves because the representation focuses on geometrically important regions.
- Inference flexibility: By varying the number of clusters, developers can trade off rendering speed vs. visual fidelity on the fly.
- No pose requirement simplifies data collection pipelines for AR/VR or robotics where accurate calibration is hard to obtain.
Practical Implications
- Faster, lighter 3D assets for AR/VR: Generate high‑quality Gaussian splats that fit comfortably on mobile GPUs, enabling real‑time view synthesis with a fraction of the memory footprint of traditional NeRF‑style pipelines.
- Simplified capture pipelines: ZipSplat works without known camera poses, allowing hobbyist photogrammetry apps to skip calibration steps and lower the barrier for user‑generated 3D content.
- Scalable cloud rendering: Cloud services can allocate fewer Gaussians per scene when serving many concurrent users, reducing bandwidth and compute costs while preserving visual quality.
- Dynamic level‑of‑detail (LOD): The clustering budget can be adjusted per frame or per device, making adaptive LOD strategies straightforward to implement in games or simulations.
- Cross‑domain transfer: Zero‑shot performance on unseen datasets suggests a single pre‑trained ZipSplat model could be shipped with SDKs, handling a wide variety of indoor/outdoor environments out of the box.
Limitations & Future Work
- Clustering overhead: Although lightweight, the K‑means step adds a non‑trivial CPU cost at inference, which may be a bottleneck for ultra‑low‑latency applications.
- Fixed token dimensionality: The current backbone produces a single token resolution; exploring multi‑scale tokens could further improve detail capture on very large scenes.
- Handling extreme view extrapolation: Like most splatting methods, ZipSplat may struggle with viewpoints far outside the training camera frustum, where Gaussian density becomes sparse.
- Future directions: The authors suggest integrating learned clustering (e.g., differentiable pooling) to eliminate the explicit K‑means step, and extending the framework to support dynamic scenes or temporal consistency for video‑based capture.
Authors
- Alexander Veicht
- Sunghwan Hong
- Dániel Baráth
- Marc Pollefeys
Paper Information
- arXiv ID: 2606.05102v1
- Categories: cs.CV
- Published: June 3, 2026
- PDF: Download PDF