[Paper] Efficiently Reconstructing Dynamic Scenes One D4RT at a Time

Published: (December 9, 2025 at 01:57 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.08924v1

Overview

The paper presents D4RT, a feed‑forward transformer that can reconstruct the full 3‑D geometry, motion, and camera pose of a dynamic scene from a single video clip. By replacing the usual dense, per‑frame decoding pipelines with a lightweight querying interface, D4RT achieves state‑of‑the‑art results on a variety of 4‑D (space + time) reconstruction benchmarks while being dramatically faster and easier to train.

Key Contributions

  • Unified transformer backbone that simultaneously predicts depth, dense spatio‑temporal correspondences, and full camera intrinsics/extrinsics from raw video.
  • Query‑based decoding: instead of decoding an entire feature map for every frame, the model answers arbitrary 3‑D‑plus‑time queries, cutting compute by orders of magnitude.
  • Task‑agnostic interface: the same decoder can be used to retrieve depth, motion vectors, or camera parameters without separate heads.
  • Scalable training: the feed‑forward design eliminates recurrent or iterative refinement steps, enabling training on commodity GPUs with batches of many video clips.
  • State‑of‑the‑art performance on multiple 4‑D reconstruction tasks (dynamic scene flow, multi‑view depth, and camera pose estimation) with up to 3× faster inference than prior methods.

Methodology

  1. Backbone encoding – A video is split into overlapping spatio‑temporal patches, which are linearly embedded and fed into a standard Vision Transformer (ViT). Positional encodings capture both spatial location and time index.
  2. Unified latent space – The transformer produces a single set of latent tokens that encode geometry, motion, and camera information jointly. No separate branches are needed.
  3. Query mechanism – To retrieve a specific 3‑D point at time t, the user supplies a query vector containing the (x, y, z, t) coordinates. The query is cross‑attended to the latent tokens, producing a compact representation that is then passed through a tiny MLP decoder.
  4. Outputs – The decoder can be asked for:
    • Depth at any pixel (by querying the corresponding ray).
    • Correspondence / flow between two timestamps (by querying the same spatial location at two times).
    • Camera parameters (by using a special “camera query” that aggregates global information).
  5. Training losses – The model is supervised with a combination of photometric reconstruction loss, depth supervision (when available), and pose consistency loss. Because the queries are differentiable, gradients flow back through the entire transformer end‑to‑end.

Results & Findings

  • Quantitative gains: On the Dynamic Scene Flow (DSF) benchmark, D4RT reduces endpoint error by 12 % relative to the previous best method. On the KITTI‑360 multi‑view depth task, it improves absolute depth error by 0.08 m.
  • Speed: Inference on a 10‑second video (30 fps) runs in ~0.6 s on an NVIDIA RTX 3090, compared to >2 s for the closest competitor.
  • Memory footprint: The query‑based decoder keeps GPU memory under 8 GB for 4‑K resolution video, enabling training on a single GPU.
  • Generalization: A single D4RT model trained on a mixed dataset (indoor + outdoor) works out‑of‑the‑box on unseen scenes without fine‑tuning, demonstrating robustness to domain shift.

Practical Implications

  • AR/VR content creation – Developers can capture a handheld video of a moving scene and instantly obtain a full 4‑D mesh for immersive experiences, without costly multi‑camera rigs.
  • Robotics and autonomous navigation – Real‑time depth + motion + pose estimation from a single onboard camera simplifies SLAM pipelines and improves obstacle prediction in dynamic environments.
  • Film VFX – The query interface lets artists extract precise 3‑D points at any frame, facilitating rotoscoping, object removal, or virtual camera insertion with far less manual labor.
  • Cloud‑scale video analytics – Because D4RT is lightweight, large video archives can be processed in batch to extract scene dynamics for indexing, search, or safety monitoring.

Limitations & Future Work

  • Sparse supervision – The model still relies on some ground‑truth depth or pose data during training; fully self‑supervised learning remains an open challenge.
  • Extreme motion blur – Very fast motions degrade the photometric loss, leading to occasional depth/flow artifacts.
  • Long‑term temporal consistency – Queries are independent; ensuring smoothness over many seconds may require an additional temporal regularizer.
  • Future directions suggested by the authors include integrating learned optical flow priors, extending the query language to support semantic attributes (e.g., “where is the car at t=5 s?”), and scaling to multi‑camera rigs for even richer reconstructions.

Authors

  • Chuhan Zhang
  • Guillaume Le Moing
  • Skanda Koppula
  • Ignacio Rocco
  • Liliane Momeni
  • Junyu Xie
  • Shuyang Sun
  • Rahul Sukthankar
  • Joëlle K Barral
  • Raia Hadsell
  • Zoubin Ghahramani
  • Andrew Zisserman
  • Junlin Zhang
  • Mehdi SM Sajjadi

Paper Information

  • arXiv ID: 2512.08924v1
  • Categories: cs.CV
  • Published: December 9, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »