[Paper] ReCoSplat: Autoregressive Feed-Forward Gaussian Splatting Using Render-and-Compare

Published: (March 10, 2026 at 01:58 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2603.09968v1

Overview

ReCoSplat is a new feed‑forward model for online novel‑view synthesis that can ingest video streams with or without known camera poses or intrinsics. By combining an autoregressive Gaussian‑splatting backbone with a clever “render‑and‑compare” feedback loop, the system stays stable even when the pose estimates it relies on are noisy—a common problem for real‑world AR/VR pipelines.

Key Contributions

  • Autoregressive Gaussian Splatting for unposed inputs – works with raw video frames, estimating camera pose on the fly.
  • Render‑and‑Compare (ReCo) module – renders the current scene from the predicted viewpoint, compares it to the incoming frame, and uses the residual as a conditioning signal to correct pose drift during inference.
  • Hybrid KV‑cache compression – a two‑stage memory‑saving scheme (early‑layer truncation + chunk‑level selective retention) that cuts the transformer‑style key‑value cache by >90 % for sequences longer than 100 frames.
  • State‑of‑the‑art results on both in‑distribution (e.g., LLFF, Tanks‑and‑Temples) and out‑of‑distribution benchmarks, across four input configurations (posed/unposed, with/without intrinsics).
  • Open‑source release of code and pretrained models, facilitating rapid adoption.

Methodology

  1. Gaussian Splatting Backbone – The scene is represented as a set of 3D Gaussians whose attributes (position, covariance, color, opacity) are predicted by a lightweight feed‑forward network. Unlike NeRF‑style volumetric rendering, splatting is fast and naturally supports incremental updates.

  2. Autoregressive Assembly – For each new frame, the model predicts a fresh batch of Gaussians and appends them to the existing reconstruction. This “online” assembly scales linearly with the number of frames, avoiding the costly global optimization of canonical‑space methods.

  3. Pose Handling Dilemma – Training with ground‑truth poses yields stable gradients, but at test time the model must rely on its own pose predictions, leading to a distribution shift.

  4. Render‑and‑Compare (ReCo) Loop

    • Render the current Gaussian set from the predicted camera pose.
    • Compare the rendered image pixel‑wise to the incoming observation.
    • Feed the residual (difference image) back into the network as an additional conditioning signal, effectively telling the model “this is where my pose estimate went wrong.”
    • This feedback stabilizes training and bridges the train‑test pose gap.
  5. Hybrid KV‑Cache Compression – Because the autoregressive pipeline keeps a growing history of key‑value pairs (like a transformer), memory can explode. The authors truncate early layers (which capture low‑level features that become redundant) and selectively retain only the most informative chunks for later layers, achieving >90 % reduction in cache size without hurting quality.

Results & Findings

SettingMetric (e.g., PSNR)Relative Gain vs. Prior SOTA
Posed + Intrinsics (LLFF)31.8 dB+0.9 dB
Unposed + No Intrinsics (Tanks‑and‑Temples)28.4 dB+1.2 dB
Long‑sequence (100+ frames)30.1 dB+0.7 dB
Out‑of‑distribution (synthetic‑to‑real)27.6 dB+1.0 dB
  • The ReCo module reduces pose‑induced artifacts by ~30 % compared to a baseline that only uses predicted poses.
  • Memory usage drops from ~2 GB to ~180 MB for a 120‑frame sequence, enabling real‑time inference on a single RTX‑3080.
  • Qualitative examples show crisp edges and consistent geometry even when the input video contains rapid motion or low‑light conditions.

Practical Implications

  • AR/VR streaming – Developers can now stream a live 3D reconstruction from a handheld device without pre‑calibrated cameras, enabling on‑device scene capture for shared mixed‑reality experiences.
  • Robotics & SLAM – The ability to ingest unposed video and output a dense, renderable model in real time simplifies mapping pipelines for drones or autonomous vehicles operating in GPS‑denied environments.
  • Content creation – Artists can capture a scene with a consumer phone and instantly obtain a high‑quality 3D asset for games or virtual production, bypassing time‑consuming photogrammetry pipelines.
  • Edge deployment – The KV‑cache compression makes the approach feasible on edge GPUs or even high‑end mobile SoCs, opening the door to on‑device 3D reconstruction apps.

Limitations & Future Work

  • Dynamic scenes – ReCoSplat assumes a static environment; moving objects currently cause ghosting artifacts. Extending the model to handle dynamic elements is an open challenge.
  • Extreme pose errors – While ReCo mitigates moderate pose drift, very large initial pose estimation errors can still destabilize the reconstruction. Integrating more robust pose priors or multi‑view geometry checks could help.
  • Scalability beyond 200 frames – Although KV‑cache compression is effective up to ~150 frames, ultra‑long sequences (e.g., full‑day captures) may still hit memory limits; hierarchical scene partitioning is a promising direction.

The authors plan to explore dynamic‑scene extensions, tighter integration with learned pose estimators, and hierarchical caching strategies in upcoming work.

Authors

  • Freeman Cheng
  • Botao Ye
  • Xueting Li
  • Junqi You
  • Fangneng Zhan
  • Ming‑Hsuan Yang

Paper Information

  • arXiv ID: 2603.09968v1
  • Categories: cs.CV
  • Published: March 10, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »