[Paper] ReCoSplat: Autoregressive Feed-Forward Gaussian Splatting Using Render-and-Compare
Source: arXiv - 2603.09968v1
Overview
ReCoSplat is a new feed‑forward model for online novel‑view synthesis that can ingest video streams with or without known camera poses or intrinsics. By combining an autoregressive Gaussian‑splatting backbone with a clever “render‑and‑compare” feedback loop, the system stays stable even when the pose estimates it relies on are noisy—a common problem for real‑world AR/VR pipelines.
Key Contributions
- Autoregressive Gaussian Splatting for unposed inputs – works with raw video frames, estimating camera pose on the fly.
- Render‑and‑Compare (ReCo) module – renders the current scene from the predicted viewpoint, compares it to the incoming frame, and uses the residual as a conditioning signal to correct pose drift during inference.
- Hybrid KV‑cache compression – a two‑stage memory‑saving scheme (early‑layer truncation + chunk‑level selective retention) that cuts the transformer‑style key‑value cache by >90 % for sequences longer than 100 frames.
- State‑of‑the‑art results on both in‑distribution (e.g., LLFF, Tanks‑and‑Temples) and out‑of‑distribution benchmarks, across four input configurations (posed/unposed, with/without intrinsics).
- Open‑source release of code and pretrained models, facilitating rapid adoption.
Methodology
-
Gaussian Splatting Backbone – The scene is represented as a set of 3D Gaussians whose attributes (position, covariance, color, opacity) are predicted by a lightweight feed‑forward network. Unlike NeRF‑style volumetric rendering, splatting is fast and naturally supports incremental updates.
-
Autoregressive Assembly – For each new frame, the model predicts a fresh batch of Gaussians and appends them to the existing reconstruction. This “online” assembly scales linearly with the number of frames, avoiding the costly global optimization of canonical‑space methods.
-
Pose Handling Dilemma – Training with ground‑truth poses yields stable gradients, but at test time the model must rely on its own pose predictions, leading to a distribution shift.
-
Render‑and‑Compare (ReCo) Loop
- Render the current Gaussian set from the predicted camera pose.
- Compare the rendered image pixel‑wise to the incoming observation.
- Feed the residual (difference image) back into the network as an additional conditioning signal, effectively telling the model “this is where my pose estimate went wrong.”
- This feedback stabilizes training and bridges the train‑test pose gap.
-
Hybrid KV‑Cache Compression – Because the autoregressive pipeline keeps a growing history of key‑value pairs (like a transformer), memory can explode. The authors truncate early layers (which capture low‑level features that become redundant) and selectively retain only the most informative chunks for later layers, achieving >90 % reduction in cache size without hurting quality.
Results & Findings
| Setting | Metric (e.g., PSNR) | Relative Gain vs. Prior SOTA |
|---|---|---|
| Posed + Intrinsics (LLFF) | 31.8 dB | +0.9 dB |
| Unposed + No Intrinsics (Tanks‑and‑Temples) | 28.4 dB | +1.2 dB |
| Long‑sequence (100+ frames) | 30.1 dB | +0.7 dB |
| Out‑of‑distribution (synthetic‑to‑real) | 27.6 dB | +1.0 dB |
- The ReCo module reduces pose‑induced artifacts by ~30 % compared to a baseline that only uses predicted poses.
- Memory usage drops from ~2 GB to ~180 MB for a 120‑frame sequence, enabling real‑time inference on a single RTX‑3080.
- Qualitative examples show crisp edges and consistent geometry even when the input video contains rapid motion or low‑light conditions.
Practical Implications
- AR/VR streaming – Developers can now stream a live 3D reconstruction from a handheld device without pre‑calibrated cameras, enabling on‑device scene capture for shared mixed‑reality experiences.
- Robotics & SLAM – The ability to ingest unposed video and output a dense, renderable model in real time simplifies mapping pipelines for drones or autonomous vehicles operating in GPS‑denied environments.
- Content creation – Artists can capture a scene with a consumer phone and instantly obtain a high‑quality 3D asset for games or virtual production, bypassing time‑consuming photogrammetry pipelines.
- Edge deployment – The KV‑cache compression makes the approach feasible on edge GPUs or even high‑end mobile SoCs, opening the door to on‑device 3D reconstruction apps.
Limitations & Future Work
- Dynamic scenes – ReCoSplat assumes a static environment; moving objects currently cause ghosting artifacts. Extending the model to handle dynamic elements is an open challenge.
- Extreme pose errors – While ReCo mitigates moderate pose drift, very large initial pose estimation errors can still destabilize the reconstruction. Integrating more robust pose priors or multi‑view geometry checks could help.
- Scalability beyond 200 frames – Although KV‑cache compression is effective up to ~150 frames, ultra‑long sequences (e.g., full‑day captures) may still hit memory limits; hierarchical scene partitioning is a promising direction.
The authors plan to explore dynamic‑scene extensions, tighter integration with learned pose estimators, and hierarchical caching strategies in upcoming work.
Authors
- Freeman Cheng
- Botao Ye
- Xueting Li
- Junqi You
- Fangneng Zhan
- Ming‑Hsuan Yang
Paper Information
- arXiv ID: 2603.09968v1
- Categories: cs.CV
- Published: March 10, 2026
- PDF: Download PDF