[Paper] Efficiently Reconstructing Dynamic Scenes One D4RT at a Time

Published: 2 months ago (December 9, 2025 at 01:57 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.08924v1

Overview

The paper presents D4RT, a feed‑forward transformer that can reconstruct the full 3‑D geometry, motion, and camera pose of a dynamic scene from a single video clip. By replacing the usual dense, per‑frame decoding pipelines with a lightweight querying interface, D4RT achieves state‑of‑the‑art results on a variety of 4‑D (space + time) reconstruction benchmarks while being dramatically faster and easier to train.

Key Contributions

Unified transformer backbone that simultaneously predicts depth, dense spatio‑temporal correspondences, and full camera intrinsics/extrinsics from raw video.
Query‑based decoding: instead of decoding an entire feature map for every frame, the model answers arbitrary 3‑D‑plus‑time queries, cutting compute by orders of magnitude.
Task‑agnostic interface: the same decoder can be used to retrieve depth, motion vectors, or camera parameters without separate heads.
Scalable training: the feed‑forward design eliminates recurrent or iterative refinement steps, enabling training on commodity GPUs with batches of many video clips.
State‑of‑the‑art performance on multiple 4‑D reconstruction tasks (dynamic scene flow, multi‑view depth, and camera pose estimation) with up to 3× faster inference than prior methods.

Methodology

Backbone encoding – A video is split into overlapping spatio‑temporal patches, which are linearly embedded and fed into a standard Vision Transformer (ViT). Positional encodings capture both spatial location and time index.
Unified latent space – The transformer produces a single set of latent tokens that encode geometry, motion, and camera information jointly. No separate branches are needed.
Query mechanism – To retrieve a specific 3‑D point at time t, the user supplies a query vector containing the (x, y, z, t) coordinates. The query is cross‑attended to the latent tokens, producing a compact representation that is then passed through a tiny MLP decoder.
Outputs – The decoder can be asked for:
- Depth at any pixel (by querying the corresponding ray).
- Correspondence / flow between two timestamps (by querying the same spatial location at two times).
- Camera parameters (by using a special “camera query” that aggregates global information).
Training losses – The model is supervised with a combination of photometric reconstruction loss, depth supervision (when available), and pose consistency loss. Because the queries are differentiable, gradients flow back through the entire transformer end‑to‑end.

Results & Findings

Quantitative gains: On the Dynamic Scene Flow (DSF) benchmark, D4RT reduces endpoint error by 12 % relative to the previous best method. On the KITTI‑360 multi‑view depth task, it improves absolute depth error by 0.08 m.
Speed: Inference on a 10‑second video (30 fps) runs in ~0.6 s on an NVIDIA RTX 3090, compared to >2 s for the closest competitor.
Memory footprint: The query‑based decoder keeps GPU memory under 8 GB for 4‑K resolution video, enabling training on a single GPU.
Generalization: A single D4RT model trained on a mixed dataset (indoor + outdoor) works out‑of‑the‑box on unseen scenes without fine‑tuning, demonstrating robustness to domain shift.

Practical Implications

AR/VR content creation – Developers can capture a handheld video of a moving scene and instantly obtain a full 4‑D mesh for immersive experiences, without costly multi‑camera rigs.
Robotics and autonomous navigation – Real‑time depth + motion + pose estimation from a single onboard camera simplifies SLAM pipelines and improves obstacle prediction in dynamic environments.
Film VFX – The query interface lets artists extract precise 3‑D points at any frame, facilitating rotoscoping, object removal, or virtual camera insertion with far less manual labor.
Cloud‑scale video analytics – Because D4RT is lightweight, large video archives can be processed in batch to extract scene dynamics for indexing, search, or safety monitoring.

Limitations & Future Work

Sparse supervision – The model still relies on some ground‑truth depth or pose data during training; fully self‑supervised learning remains an open challenge.
Extreme motion blur – Very fast motions degrade the photometric loss, leading to occasional depth/flow artifacts.
Long‑term temporal consistency – Queries are independent; ensuring smoothness over many seconds may require an additional temporal regularizer.
Future directions suggested by the authors include integrating learned optical flow priors, extending the query language to support semantic attributes (e.g., “where is the car at t=5 s?”), and scaling to multi‑camera rigs for even richer reconstructions.

Authors

Chuhan Zhang
Guillaume Le Moing
Skanda Koppula
Ignacio Rocco
Liliane Momeni
Junyu Xie
Shuyang Sun
Rahul Sukthankar
Joëlle K Barral
Raia Hadsell
Zoubin Ghahramani
Andrew Zisserman
Junlin Zhang
Mehdi SM Sajjadi

Paper Information

arXiv ID: 2512.08924v1
Categories: cs.CV
Published: December 9, 2025
PDF: Download PDF

[Paper] Efficiently Reconstructing Dynamic Scenes One D4RT at a Time

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Moment-Based 3D Gaussian Splatting: Resolving Volumetric Occlusion with Order-Independent Transmittance

[Paper] V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] AnchorDream: Repurposing Video Diffusion for Embodiment-Aware Robot Data Synthesis