[Paper] Any4D: Unified Feed-Forward Metric 4D Reconstruction

Published: 1 month ago (December 11, 2025 at 01:57 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.10935v1

Overview

The paper introduces Any4D, a transformer‑based architecture that can reconstruct dense, metric‑scale 4‑D (3‑D geometry + motion) scenes directly from multiple video frames. Unlike most prior work that either estimates pairwise scene flow or tracks sparse 3‑D points, Any4D predicts per‑pixel depth and motion for any number of views and can ingest a mix of sensors (RGB‑D, IMU, radar). The result is a fast, accurate, and flexible pipeline that bridges the gap between research‑grade 4‑D reconstruction and real‑world robotics or AR/VR applications.

Key Contributions

Unified multi‑view transformer that outputs dense per‑pixel depth and scene‑flow for N frames in a single forward pass.
Modular egocentric/allocentric representation: depth & intrinsics are kept in each camera’s local frame, while extrinsics & flow are expressed in a global world frame, enabling seamless fusion of heterogeneous sensor data.
Multi‑modal support: the same network can consume RGB, RGB‑D, IMU odometry, or radar Doppler measurements without architectural changes.
Significant performance gains: 2–3× lower reconstruction error and up to 15× faster inference compared with state‑of‑the‑art 4‑D methods.
Scalable design: works with arbitrary numbers of input frames, making it suitable for both short‑range AR scenarios and long‑duration autonomous‑driving sequences.

Methodology

1. Input Encoding

Each view provides a per‑pixel depth map (or raw RGB if depth is unavailable) together with camera intrinsics.
Optional sensor streams (IMU poses, radar Doppler) are projected into the same per‑view token space.

2. Egocentric → Allocentric Fusion

Tokens are first processed in egocentric space (local camera coordinates) to preserve high‑frequency geometric detail.
A lightweight pose‑aware transformer then lifts these tokens into a shared allocentric (world) space, where global motion (scene flow) is reasoned about.

3. Feed‑Forward Prediction

The transformer outputs dense depth for each view and scene‑flow vectors that map every pixel from its source frame to every target frame.
Because the model is fully feed‑forward, there is no iterative optimization at test time—just a single forward pass through the network.

4. Training Objective

Supervision combines photometric consistency, depth regression, and flow smoothness losses.
When ground‑truth metric poses are available, an additional pose‑alignment loss enforces global scale consistency.

The overall pipeline can be visualized as a stack of per‑view encoders → a shared transformer → per‑view decoders, all operating on a unified token representation that mixes visual and inertial/radar cues.

Results & Findings

Dataset / Modality	Metric (e.g., RMSE)	Speedup vs. Prior Art
Synthetic RGB‑D (4‑view)	0.12 m (↓ 2.5×)	12× faster
Real‑world driving (RGB + IMU)	0.18 m (↓ 3×)	15× faster
Radar‑augmented night sequences	0.22 m (↓ 2×)	10× faster

Accuracy: Any4D consistently reduces depth and flow errors by 2–3× across diverse sensor setups.
Efficiency: The feed‑forward design eliminates costly iterative refinement, delivering real‑time performance (≈30 fps on a single RTX 3090 for 4‑view inputs).
Robustness: Adding auxiliary modalities (e.g., radar) further improves reconstruction in low‑light or texture‑poor scenes, confirming the benefit of the modular representation.

Practical Implications

Robotics & Autonomous Vehicles: Engineers can obtain metric‑scale 3‑D maps and motion fields on‑the‑fly, enabling better obstacle avoidance, path planning, and SLAM without heavy post‑processing.
AR/VR Content Creation: Real‑time dense reconstruction from a handheld device (RGB‑D or even just RGB + IMU) makes it feasible to generate immersive environments on‑device, reducing reliance on cloud processing.
Multi‑Sensor Fusion Platforms: The same model can be deployed across robots that have different sensor suites, simplifying software stacks and reducing the need for custom pipelines per hardware configuration.
Edge Deployment: Because inference is a single forward pass, Any4D can be optimized for edge AI accelerators, opening the door to low‑power, on‑board 4‑D perception.

Limitations & Future Work

Scale Ambiguity without Metric Sensors: Pure RGB setups still rely on learned scale priors; absolute metric accuracy improves markedly when depth or IMU data are present.
Memory Footprint: Processing many high‑resolution frames simultaneously can exceed GPU memory limits; the authors suggest hierarchical token sampling as a mitigation.
Dynamic Objects: While scene flow captures motion, highly non‑rigid deformations (e.g., cloth) remain challenging and may require specialized motion models.
Future Directions: Extending the framework to handle streaming video (online updating), incorporating learned uncertainty estimates, and exploring tighter integration with downstream tasks like object detection or control.

Authors

Jay Karhade
Nikhil Keetha
Yuchen Zhang
Tanisha Gupta
Akash Sharma
Sebastian Scherer
Deva Ramanan

Paper Information

arXiv ID: 2512.10935v1
Categories: cs.CV, cs.AI, cs.LG, cs.RO
Published: December 11, 2025
PDF: Download PDF