[Paper] AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model

Published: (April 21, 2026 at 01:59 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2604.19747v1

Overview

AnyRecon tackles the long‑standing problem of reconstructing high‑quality 3D geometry from a handful of casually captured video frames. By marrying a video diffusion model with a persistent scene‑wide memory, the authors enable arbitrary‑view, unordered, and sparse inputs while still preserving geometric consistency—something previous diffusion‑based pipelines struggled with.

Key Contributions

  • Any‑view conditioning: Supports any number of input frames (from 1 to dozens) without re‑training, thanks to a global scene memory that stores all captured views.
  • Persistent global scene memory: A “capture view cache” that keeps frame‑level correspondence even under large viewpoint changes, eliminating the need for temporal compression.
  • Geometry‑aware conditioning: Couples generation and reconstruction through an explicit 3D geometric memory and a retrieval mechanism that selects the most relevant cached views for each diffusion step.
  • Efficient diffusion: Combines 4‑step diffusion distillation with context‑window sparse attention, cutting the quadratic cost of vanilla video diffusion while keeping generation fidelity.
  • Scalable to large, irregular scenes: Demonstrated on long trajectories, wide baseline gaps, and heterogeneous capture orders, showing robustness where prior methods fail.

Methodology

  1. Capture‑View Cache: When a video (or a set of unordered frames) is fed into the system, each frame is encoded and stored in a global memory bank. This bank is persistent—it never gets compressed away, so the model can always reference the exact pixel‑level information of any captured view.

  2. Video Diffusion Backbone: A standard video diffusion model (U‑Net‑style denoiser) is used, but the authors replace the usual full‑self‑attention with context‑window sparse attention. This limits attention to a sliding window of frames, drastically reducing the O(N²) memory/computation while still allowing long‑range interactions via the cache.

  3. Geometry‑Aware Retrieval: For each diffusion timestep, the model queries the 3D geometric memory (a voxel/point‑cloud representation built on‑the‑fly) to fetch the most geometrically relevant cached views. Those views are concatenated to the diffusion input, guiding the network toward geometrically plausible color and depth predictions.

  4. 4‑Step Diffusion Distillation: Instead of running the full 1000‑step denoising chain, a lightweight student model is trained to mimic the output of a teacher after just four diffusion steps. This yields a ~10× speed‑up with negligible loss in visual quality.

  5. Reconstruction Pipeline: After the diffusion model synthesizes novel views from arbitrary query poses, a standard multi‑view stereo (MVS) module extracts depth maps, which are fused into a final 3D mesh/point cloud. Because the generated views are already geometry‑consistent, the MVS stage converges faster and with fewer artifacts.

Results & Findings

  • Robustness to Sparse & Irregular Inputs: With as few as 3–5 widely spaced frames, AnyRecon recovers dense geometry comparable to methods that require dozens of calibrated images.
  • Large Viewpoint Gaps: Experiments on synthetic and real‑world datasets (e.g., ScanNet, Tanks‑&‑Temples) show less than 5 % Chamfer distance degradation when the baseline between input views grows from 30° to 120°.
  • Speed & Memory: The sparse‑attention + 4‑step distillation reduces GPU memory from ~12 GB (full diffusion) to ~4 GB and inference time from ~30 s to ≈3 s per scene on an RTX 3090.
  • Qualitative Gains: Rendered novel views exhibit fewer ghosting artifacts and better texture continuity across seams, confirming that the geometry‑aware cache successfully enforces cross‑view consistency.

Practical ImpImplications

  • Rapid Prototyping for AR/VR: Developers can now generate usable 3D assets from a handful of phone videos, cutting down on costly photogrammetry pipelines.
  • Content Creation for Games & Metaverses: Studios can ingest unordered footage (e.g., from stunt rehearsals) and obtain clean geometry without manual camera calibration.
  • Robotics & Autonomous Navigation: Sparse on‑board cameras can feed AnyRecon to build up‑to‑date scene maps in real time, aiding SLAM systems that struggle with texture‑poor environments.
  • Scalable Cloud Services: The efficient diffusion backbone makes it feasible to expose AnyRecon as a SaaS offering—users upload a few clips, receive a 3D model within minutes, and pay only for compute time.

Limitations & Future Work

  • Reliance on Diffusion Quality: The final geometry is only as good as the synthesized views; extreme lighting changes or motion blur can still propagate errors.
  • Memory Footprint for Very Long Sequences: Although sparse attention mitigates quadratic growth, the global cache still scales linearly with the number of input frames, which may become a bottleneck for hour‑long videos.
  • Limited Dynamic Scene Handling: The current pipeline assumes static geometry; extending the method to moving objects or deformable scenes is left for future research.
  • Evaluation on Real‑World Large‑Scale Outdoor Scenes: The authors note that testing on city‑scale reconstructions (e.g., autonomous‑driving datasets) remains an open challenge.

AnyRecon demonstrates that diffusion models, when paired with clever memory and geometry tricks, can finally bridge the gap between generative view synthesis and practical 3D reconstruction—a promising step toward democratizing high‑fidelity 3D content creation.

Authors

  • Yutian Chen
  • Shi Guo
  • Renbiao Jin
  • Tianshuo Yang
  • Xin Cai
  • Yawen Luo
  • Mingxin Yang
  • Mulin Yu
  • Linning Xu
  • Tianfan Xue

Paper Information

  • arXiv ID: 2604.19747v1
  • Categories: cs.CV
  • Published: April 21, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »