[Paper] AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model

Published: 2 days ago (April 21, 2026 at 01:59 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.19747v1

Overview

AnyRecon tackles the long‑standing problem of reconstructing high‑quality 3D geometry from a handful of casually captured video frames. By marrying a video diffusion model with a persistent scene‑wide memory, the authors enable arbitrary‑view, unordered, and sparse inputs while still preserving geometric consistency—something previous diffusion‑based pipelines struggled with.

Key Contributions

Any‑view conditioning: Supports any number of input frames (from 1 to dozens) without re‑training, thanks to a global scene memory that stores all captured views.
Persistent global scene memory: A “capture view cache” that keeps frame‑level correspondence even under large viewpoint changes, eliminating the need for temporal compression.
Geometry‑aware conditioning: Couples generation and reconstruction through an explicit 3D geometric memory and a retrieval mechanism that selects the most relevant cached views for each diffusion step.
Efficient diffusion: Combines 4‑step diffusion distillation with context‑window sparse attention, cutting the quadratic cost of vanilla video diffusion while keeping generation fidelity.
Scalable to large, irregular scenes: Demonstrated on long trajectories, wide baseline gaps, and heterogeneous capture orders, showing robustness where prior methods fail.

Methodology

Capture‑View Cache: When a video (or a set of unordered frames) is fed into the system, each frame is encoded and stored in a global memory bank. This bank is persistent—it never gets compressed away, so the model can always reference the exact pixel‑level information of any captured view.
Video Diffusion Backbone: A standard video diffusion model (U‑Net‑style denoiser) is used, but the authors replace the usual full‑self‑attention with context‑window sparse attention. This limits attention to a sliding window of frames, drastically reducing the O(N²) memory/computation while still allowing long‑range interactions via the cache.
Geometry‑Aware Retrieval: For each diffusion timestep, the model queries the 3D geometric memory (a voxel/point‑cloud representation built on‑the‑fly) to fetch the most geometrically relevant cached views. Those views are concatenated to the diffusion input, guiding the network toward geometrically plausible color and depth predictions.
4‑Step Diffusion Distillation: Instead of running the full 1000‑step denoising chain, a lightweight student model is trained to mimic the output of a teacher after just four diffusion steps. This yields a ~10× speed‑up with negligible loss in visual quality.
Reconstruction Pipeline: After the diffusion model synthesizes novel views from arbitrary query poses, a standard multi‑view stereo (MVS) module extracts depth maps, which are fused into a final 3D mesh/point cloud. Because the generated views are already geometry‑consistent, the MVS stage converges faster and with fewer artifacts.

Results & Findings

Robustness to Sparse & Irregular Inputs: With as few as 3–5 widely spaced frames, AnyRecon recovers dense geometry comparable to methods that require dozens of calibrated images.
Large Viewpoint Gaps: Experiments on synthetic and real‑world datasets (e.g., ScanNet, Tanks‑&‑Temples) show less than 5 % Chamfer distance degradation when the baseline between input views grows from 30° to 120°.
Speed & Memory: The sparse‑attention + 4‑step distillation reduces GPU memory from ~12 GB (full diffusion) to ~4 GB and inference time from ~30 s to ≈3 s per scene on an RTX 3090.
Qualitative Gains: Rendered novel views exhibit fewer ghosting artifacts and better texture continuity across seams, confirming that the geometry‑aware cache successfully enforces cross‑view consistency.

Practical ImpImplications

Rapid Prototyping for AR/VR: Developers can now generate usable 3D assets from a handful of phone videos, cutting down on costly photogrammetry pipelines.
Content Creation for Games & Metaverses: Studios can ingest unordered footage (e.g., from stunt rehearsals) and obtain clean geometry without manual camera calibration.
Robotics & Autonomous Navigation: Sparse on‑board cameras can feed AnyRecon to build up‑to‑date scene maps in real time, aiding SLAM systems that struggle with texture‑poor environments.
Scalable Cloud Services: The efficient diffusion backbone makes it feasible to expose AnyRecon as a SaaS offering—users upload a few clips, receive a 3D model within minutes, and pay only for compute time.

Limitations & Future Work

Reliance on Diffusion Quality: The final geometry is only as good as the synthesized views; extreme lighting changes or motion blur can still propagate errors.
Memory Footprint for Very Long Sequences: Although sparse attention mitigates quadratic growth, the global cache still scales linearly with the number of input frames, which may become a bottleneck for hour‑long videos.
Limited Dynamic Scene Handling: The current pipeline assumes static geometry; extending the method to moving objects or deformable scenes is left for future research.
Evaluation on Real‑World Large‑Scale Outdoor Scenes: The authors note that testing on city‑scale reconstructions (e.g., autonomous‑driving datasets) remains an open challenge.

AnyRecon demonstrates that diffusion models, when paired with clever memory and geometry tricks, can finally bridge the gap between generative view synthesis and practical 3D reconstruction—a promising step toward democratizing high‑fidelity 3D content creation.

Authors

Yutian Chen
Shi Guo
Renbiao Jin
Tianshuo Yang
Xin Cai
Yawen Luo
Mingxin Yang
Mulin Yu
Linning Xu
Tianfan Xue

Paper Information

arXiv ID: 2604.19747v1
Categories: cs.CV
Published: April 21, 2026
PDF: Download PDF

[Paper] AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model

Overview

Key Contributions

Methodology

Results & Findings

Practical ImpImplications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Seeing Fast and Slow: Learning the Flow of Time in Videos

[Paper] Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs

[Paper] Context Unrolling in Omni Models

[Paper] Vista4D: Video Reshooting with 4D Point Clouds