[Paper] ZipSplat: Fewer Gaussians, Better Splats

Published: 1 week ago (June 3, 2026 at 01:04 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2606.05102v1

Overview

ZipSplat introduces a token‑based, feed‑forward pipeline for 3D Gaussian splatting that decouples the number of Gaussians from the input image resolution. By clustering visual tokens into a compact set of scene tokens, the method can represent a scene with far fewer Gaussians while preserving—or even improving—rendering quality, all without needing ground‑truth camera poses or intrinsics.

Key Contributions

Token‑driven Gaussian placement: Replaces the naïve “one‑Gaussian‑per‑pixel” strategy with a clustering step that allocates Gaussians where the scene actually needs them.
Single model for the whole quality‑efficiency curve: Because clustering is performed at inference time, the same trained network can be run with different numbers of clusters, trading speed for fidelity without retraining.
Pose‑free training: The backbone learns dense visual tokens directly from multi‑view images, eliminating the need for calibrated camera parameters.
State‑of‑the‑art results with ~6× fewer Gaussians: Sets new benchmarks on DL3DV and RealEstate10K, and generalizes zero‑shot to Mip‑NeRF360 and ScanNet++.
Lightweight decoding: A small MLP turns each scene token into a small group of Gaussians with unrestricted 3D positions, keeping inference fast.

Methodology

Multi‑view feature extraction: A shared CNN (or transformer) processes all input images simultaneously, producing dense per‑pixel visual tokens that encode color, texture, and coarse geometry cues.
K‑means clustering (inference‑time): The dense token map is flattened and clustered into N scene tokens (the user‑controlled budget). This step compresses redundant information—e.g., flat walls—while preserving detail‑rich regions.
Cross‑ and self‑attention refinement: The scene tokens attend to each other and to the original visual tokens, allowing global context (e.g., object boundaries) to be incorporated.
Gaussian decoding: A lightweight MLP takes each refined scene token and predicts a small set of 3D Gaussians (position, covariance, color, opacity). The Gaussians are unconstrained—they can sit anywhere in space, not just on a pixel grid.
Rendering: Standard splatting renders the Gaussian cloud from arbitrary viewpoints, yielding a novel‑view image.

Because the clustering step is separate from the learned network, developers can simply change the number of clusters to meet memory or latency constraints.

Results & Findings

Dataset	PSNR (dB)	Gaussians (× fewer)	Relative gain vs. pixel‑aligned
DL3DV	+2.1 over best pose‑free baseline	~6× fewer	New SOTA
RealEstate10K	+1.2 over best pose‑free baseline	~6× fewer	New SOTA
Mip‑NeRF360 (zero‑shot)	Competitive / superior to baselines	Same model	Demonstrates strong generalization
ScanNet++ (zero‑shot)	Competitive	Same model	Shows robustness to indoor scans

Key takeaways

Quality does not degrade with fewer Gaussians; in many cases it improves because the representation focuses on geometrically important regions.
Inference flexibility: By varying the number of clusters, developers can trade off rendering speed vs. visual fidelity on the fly.
No pose requirement simplifies data collection pipelines for AR/VR or robotics where accurate calibration is hard to obtain.

Practical Implications

Faster, lighter 3D assets for AR/VR: Generate high‑quality Gaussian splats that fit comfortably on mobile GPUs, enabling real‑time view synthesis with a fraction of the memory footprint of traditional NeRF‑style pipelines.
Simplified capture pipelines: ZipSplat works without known camera poses, allowing hobbyist photogrammetry apps to skip calibration steps and lower the barrier for user‑generated 3D content.
Scalable cloud rendering: Cloud services can allocate fewer Gaussians per scene when serving many concurrent users, reducing bandwidth and compute costs while preserving visual quality.
Dynamic level‑of‑detail (LOD): The clustering budget can be adjusted per frame or per device, making adaptive LOD strategies straightforward to implement in games or simulations.
Cross‑domain transfer: Zero‑shot performance on unseen datasets suggests a single pre‑trained ZipSplat model could be shipped with SDKs, handling a wide variety of indoor/outdoor environments out of the box.

Limitations & Future Work

Clustering overhead: Although lightweight, the K‑means step adds a non‑trivial CPU cost at inference, which may be a bottleneck for ultra‑low‑latency applications.
Fixed token dimensionality: The current backbone produces a single token resolution; exploring multi‑scale tokens could further improve detail capture on very large scenes.
Handling extreme view extrapolation: Like most splatting methods, ZipSplat may struggle with viewpoints far outside the training camera frustum, where Gaussian density becomes sparse.
Future directions: The authors suggest integrating learned clustering (e.g., differentiable pooling) to eliminate the explicit K‑means step, and extending the framework to support dynamic scenes or temporal consistency for video‑based capture.

Authors

Alexander Veicht
Sunghwan Hong
Dániel Baráth
Marc Pollefeys

Paper Information

arXiv ID: 2606.05102v1
Categories: cs.CV
Published: June 3, 2026
PDF: Download PDF

[Paper] ZipSplat: Fewer Gaussians, Better Splats

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] UniSHARP: Universal Sharp Monocular View Synthesis

[Paper] MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

[Paper] Streaming Video Generation with Streaming Force Control

[Paper] Differences in Detection: Explainability Where it Matters