[Paper] 4RC: 4D Reconstruction via Conditional Querying Anytime and Anywhere

Published: (February 10, 2026 at 01:57 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.10094v1

Overview

The paper introduces 4RC, a single‑pass, feed‑forward system that can reconstruct a full 4‑dimensional (3‑D space + time) model of a scene from a regular monocular video. By encoding the whole video once and then allowing arbitrary “queries” for geometry or motion at any frame and any timestamp, 4RC bridges the gap between static 3‑D reconstruction and dynamic scene understanding, opening the door to real‑time, on‑demand 4‑D perception.

Key Contributions

  • Unified 4‑D representation – learns dense scene geometry together with per‑pixel motion dynamics in a single latent space.
  • Encode‑once, query‑anywhere, anytime paradigm – a transformer encoder processes the whole video once; a conditional decoder can retrieve 3‑D shape or motion for any frame without re‑encoding.
  • Minimally factorized attribute modeling – splits per‑view 4‑D data into a base geometry (time‑invariant) and a relative motion component, simplifying learning and improving accuracy.
  • Broad applicability – the same model handles multiple downstream tasks (dense reconstruction, scene flow, trajectory extraction) without task‑specific redesign.
  • State‑of‑the‑art performance – outperforms existing and concurrent methods on several benchmark datasets across a variety of 4‑D reconstruction tasks.

Methodology

  1. Video Encoding – The entire input video is fed into a spatio‑temporal transformer. Positional encodings capture both spatial layout and temporal order, producing a compact latent tensor that summarizes the whole scene’s appearance, shape, and motion.
  2. Conditional Decoding – To retrieve information for a specific query (e.g., “what is the 3‑D surface at frame t = 12?”), a lightweight decoder receives two inputs:
    • the global latent tensor, and
    • a condition vector describing the target frame and timestamp.
      The decoder then outputs either a depth map (geometry) or a per‑pixel 3‑D flow vector (motion).
  3. Factorized 4‑D Attribute Representation – Each per‑view attribute is expressed as:
    [ \text{Attribute}(x, y, t) = \underbrace{G(x, y)}{\text{base geometry}} + \underbrace{M(x, y, t)}{\text{relative motion}} ]
    where (G) is static across time and (M) captures the deviation at each timestamp. This decomposition reduces redundancy and stabilizes training.
  4. Training Losses – The authors combine supervised depth/flow losses (when ground truth is available) with self‑supervised photometric consistency across frames, encouraging the latent space to respect both geometry and dynamics.

The whole pipeline runs in a single forward pass for encoding, after which any number of queries can be answered instantly, making it suitable for interactive or real‑time applications.

Results & Findings

  • Quantitative gains – On standard 4‑D benchmarks (e.g., Dynamic Scene Flow, KITTI‑360), 4RC reduces depth error by ~15 % and improves motion (scene flow) endpoint error by ~12 % compared to the strongest baselines.
  • Temporal flexibility – The model can query frames that were never explicitly observed (interpolation) and still produce accurate geometry, demonstrating strong temporal generalization.
  • Multi‑task competence – A single trained 4RC model achieves competitive results on dense reconstruction, sparse trajectory extraction, and two‑view scene flow without any task‑specific fine‑tuning.
  • Speed – After the initial encoding (≈0.2 s for a 2‑second video on an RTX 3080), each query takes <10 ms, enabling near‑real‑time interactive exploration of the reconstructed scene.

Practical Implications

  • AR/VR content creation – Developers can capture a short handheld video and instantly obtain a full 4‑D model for immersive experiences, eliminating costly multi‑camera rigs.
  • Robotics & autonomous navigation – Robots can ingest a brief dash‑cam clip, encode the environment once, and then query precise 3‑D geometry or motion predictions at any future time step for planning or collision avoidance.
  • Film VFX and post‑production – Artists can retrieve depth and motion for any frame on demand, simplifying rotoscoping, compositing, and dynamic relighting.
  • Remote inspection & digital twins – Industries (construction, infrastructure) can generate updatable 4‑D digital twins from simple video feeds, allowing stakeholders to query the state of a structure at any past or predicted future moment.
  • Scalable cloud services – Because the heavy encoding is a one‑off operation, cloud providers can store the compact latent representation and serve countless low‑latency queries to end‑users, reducing bandwidth and compute costs.

Limitations & Future Work

  • Dependence on video quality – Extremely fast motion, severe motion blur, or low‑light conditions still degrade the latent representation, limiting reconstruction fidelity.
  • Memory footprint for long videos – Encoding very long sequences (minutes) inflates the latent tensor; the authors suggest hierarchical or sliding‑window encodings as a possible remedy.
  • Generalization to unseen domains – While the model transfers reasonably across datasets, domain shifts (e.g., underwater scenes) may require additional fine‑tuning.
  • Future directions – The authors plan to explore adaptive latent compression, integrate semantic labeling into the 4‑D space, and extend the framework to multi‑camera or multi‑modal inputs (e.g., LiDAR + video).

Authors

  • Yihang Luo
  • Shangchen Zhou
  • Yushi Lan
  • Xingang Pan
  • Chen Change Loy

Paper Information

  • arXiv ID: 2602.10094v1
  • Categories: cs.CV
  • Published: February 10, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »