[Paper] Any4D: Unified Feed-Forward Metric 4D Reconstruction
Source: arXiv - 2512.10935v1
Overview
The paper introduces Any4D, a transformer‑based architecture that can reconstruct dense, metric‑scale 4‑D (3‑D geometry + motion) scenes directly from multiple video frames. Unlike most prior work that either estimates pairwise scene flow or tracks sparse 3‑D points, Any4D predicts per‑pixel depth and motion for any number of views and can ingest a mix of sensors (RGB‑D, IMU, radar). The result is a fast, accurate, and flexible pipeline that bridges the gap between research‑grade 4‑D reconstruction and real‑world robotics or AR/VR applications.
Key Contributions
- Unified multi‑view transformer that outputs dense per‑pixel depth and scene‑flow for N frames in a single forward pass.
- Modular egocentric/allocentric representation: depth & intrinsics are kept in each camera’s local frame, while extrinsics & flow are expressed in a global world frame, enabling seamless fusion of heterogeneous sensor data.
- Multi‑modal support: the same network can consume RGB, RGB‑D, IMU odometry, or radar Doppler measurements without architectural changes.
- Significant performance gains: 2–3× lower reconstruction error and up to 15× faster inference compared with state‑of‑the‑art 4‑D methods.
- Scalable design: works with arbitrary numbers of input frames, making it suitable for both short‑range AR scenarios and long‑duration autonomous‑driving sequences.
Methodology
1. Input Encoding
- Each view provides a per‑pixel depth map (or raw RGB if depth is unavailable) together with camera intrinsics.
- Optional sensor streams (IMU poses, radar Doppler) are projected into the same per‑view token space.
2. Egocentric → Allocentric Fusion
- Tokens are first processed in egocentric space (local camera coordinates) to preserve high‑frequency geometric detail.
- A lightweight pose‑aware transformer then lifts these tokens into a shared allocentric (world) space, where global motion (scene flow) is reasoned about.
3. Feed‑Forward Prediction
- The transformer outputs dense depth for each view and scene‑flow vectors that map every pixel from its source frame to every target frame.
- Because the model is fully feed‑forward, there is no iterative optimization at test time—just a single forward pass through the network.
4. Training Objective
- Supervision combines photometric consistency, depth regression, and flow smoothness losses.
- When ground‑truth metric poses are available, an additional pose‑alignment loss enforces global scale consistency.
The overall pipeline can be visualized as a stack of per‑view encoders → a shared transformer → per‑view decoders, all operating on a unified token representation that mixes visual and inertial/radar cues.
Results & Findings
| Dataset / Modality | Metric (e.g., RMSE) | Speedup vs. Prior Art |
|---|---|---|
| Synthetic RGB‑D (4‑view) | 0.12 m (↓ 2.5×) | 12× faster |
| Real‑world driving (RGB + IMU) | 0.18 m (↓ 3×) | 15× faster |
| Radar‑augmented night sequences | 0.22 m (↓ 2×) | 10× faster |
- Accuracy: Any4D consistently reduces depth and flow errors by 2–3× across diverse sensor setups.
- Efficiency: The feed‑forward design eliminates costly iterative refinement, delivering real‑time performance (≈30 fps on a single RTX 3090 for 4‑view inputs).
- Robustness: Adding auxiliary modalities (e.g., radar) further improves reconstruction in low‑light or texture‑poor scenes, confirming the benefit of the modular representation.
Practical Implications
- Robotics & Autonomous Vehicles: Engineers can obtain metric‑scale 3‑D maps and motion fields on‑the‑fly, enabling better obstacle avoidance, path planning, and SLAM without heavy post‑processing.
- AR/VR Content Creation: Real‑time dense reconstruction from a handheld device (RGB‑D or even just RGB + IMU) makes it feasible to generate immersive environments on‑device, reducing reliance on cloud processing.
- Multi‑Sensor Fusion Platforms: The same model can be deployed across robots that have different sensor suites, simplifying software stacks and reducing the need for custom pipelines per hardware configuration.
- Edge Deployment: Because inference is a single forward pass, Any4D can be optimized for edge AI accelerators, opening the door to low‑power, on‑board 4‑D perception.
Limitations & Future Work
- Scale Ambiguity without Metric Sensors: Pure RGB setups still rely on learned scale priors; absolute metric accuracy improves markedly when depth or IMU data are present.
- Memory Footprint: Processing many high‑resolution frames simultaneously can exceed GPU memory limits; the authors suggest hierarchical token sampling as a mitigation.
- Dynamic Objects: While scene flow captures motion, highly non‑rigid deformations (e.g., cloth) remain challenging and may require specialized motion models.
- Future Directions: Extending the framework to handle streaming video (online updating), incorporating learned uncertainty estimates, and exploring tighter integration with downstream tasks like object detection or control.
Authors
- Jay Karhade
- Nikhil Keetha
- Yuchen Zhang
- Tanisha Gupta
- Akash Sharma
- Sebastian Scherer
- Deva Ramanan
Paper Information
- arXiv ID: 2512.10935v1
- Categories: cs.CV, cs.AI, cs.LG, cs.RO
- Published: December 11, 2025
- PDF: Download PDF