[Paper] Efficiently Reconstructing Dynamic Scenes One D4RT at a Time
Source: arXiv - 2512.08924v1
Overview
The paper presents D4RT, a feed‑forward transformer that can reconstruct the full 3‑D geometry, motion, and camera pose of a dynamic scene from a single video clip. By replacing the usual dense, per‑frame decoding pipelines with a lightweight querying interface, D4RT achieves state‑of‑the‑art results on a variety of 4‑D (space + time) reconstruction benchmarks while being dramatically faster and easier to train.
Key Contributions
- Unified transformer backbone that simultaneously predicts depth, dense spatio‑temporal correspondences, and full camera intrinsics/extrinsics from raw video.
- Query‑based decoding: instead of decoding an entire feature map for every frame, the model answers arbitrary 3‑D‑plus‑time queries, cutting compute by orders of magnitude.
- Task‑agnostic interface: the same decoder can be used to retrieve depth, motion vectors, or camera parameters without separate heads.
- Scalable training: the feed‑forward design eliminates recurrent or iterative refinement steps, enabling training on commodity GPUs with batches of many video clips.
- State‑of‑the‑art performance on multiple 4‑D reconstruction tasks (dynamic scene flow, multi‑view depth, and camera pose estimation) with up to 3× faster inference than prior methods.
Methodology
- Backbone encoding – A video is split into overlapping spatio‑temporal patches, which are linearly embedded and fed into a standard Vision Transformer (ViT). Positional encodings capture both spatial location and time index.
- Unified latent space – The transformer produces a single set of latent tokens that encode geometry, motion, and camera information jointly. No separate branches are needed.
- Query mechanism – To retrieve a specific 3‑D point at time t, the user supplies a query vector containing the (x, y, z, t) coordinates. The query is cross‑attended to the latent tokens, producing a compact representation that is then passed through a tiny MLP decoder.
- Outputs – The decoder can be asked for:
- Depth at any pixel (by querying the corresponding ray).
- Correspondence / flow between two timestamps (by querying the same spatial location at two times).
- Camera parameters (by using a special “camera query” that aggregates global information).
- Training losses – The model is supervised with a combination of photometric reconstruction loss, depth supervision (when available), and pose consistency loss. Because the queries are differentiable, gradients flow back through the entire transformer end‑to‑end.
Results & Findings
- Quantitative gains: On the Dynamic Scene Flow (DSF) benchmark, D4RT reduces endpoint error by 12 % relative to the previous best method. On the KITTI‑360 multi‑view depth task, it improves absolute depth error by 0.08 m.
- Speed: Inference on a 10‑second video (30 fps) runs in ~0.6 s on an NVIDIA RTX 3090, compared to >2 s for the closest competitor.
- Memory footprint: The query‑based decoder keeps GPU memory under 8 GB for 4‑K resolution video, enabling training on a single GPU.
- Generalization: A single D4RT model trained on a mixed dataset (indoor + outdoor) works out‑of‑the‑box on unseen scenes without fine‑tuning, demonstrating robustness to domain shift.
Practical Implications
- AR/VR content creation – Developers can capture a handheld video of a moving scene and instantly obtain a full 4‑D mesh for immersive experiences, without costly multi‑camera rigs.
- Robotics and autonomous navigation – Real‑time depth + motion + pose estimation from a single onboard camera simplifies SLAM pipelines and improves obstacle prediction in dynamic environments.
- Film VFX – The query interface lets artists extract precise 3‑D points at any frame, facilitating rotoscoping, object removal, or virtual camera insertion with far less manual labor.
- Cloud‑scale video analytics – Because D4RT is lightweight, large video archives can be processed in batch to extract scene dynamics for indexing, search, or safety monitoring.
Limitations & Future Work
- Sparse supervision – The model still relies on some ground‑truth depth or pose data during training; fully self‑supervised learning remains an open challenge.
- Extreme motion blur – Very fast motions degrade the photometric loss, leading to occasional depth/flow artifacts.
- Long‑term temporal consistency – Queries are independent; ensuring smoothness over many seconds may require an additional temporal regularizer.
- Future directions suggested by the authors include integrating learned optical flow priors, extending the query language to support semantic attributes (e.g., “where is the car at t=5 s?”), and scaling to multi‑camera rigs for even richer reconstructions.
Authors
- Chuhan Zhang
- Guillaume Le Moing
- Skanda Koppula
- Ignacio Rocco
- Liliane Momeni
- Junyu Xie
- Shuyang Sun
- Rahul Sukthankar
- Joëlle K Barral
- Raia Hadsell
- Zoubin Ghahramani
- Andrew Zisserman
- Junlin Zhang
- Mehdi SM Sajjadi
Paper Information
- arXiv ID: 2512.08924v1
- Categories: cs.CV
- Published: December 9, 2025
- PDF: Download PDF