[Paper] UFO-4D: Unposed Feedforward 4D Reconstruction from Two Images

Published: 3 days ago (February 27, 2026 at 01:59 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.24290v1

Overview

UFO‑4D presents a single feed‑forward network that can turn just two uncalibrated photos into a dense, time‑varying 3D model. By predicting a set of dynamic 3‑D Gaussian “splats,” the system simultaneously recovers scene geometry, per‑pixel motion, and the cameras’ poses—without any test‑time optimization. This makes dense 4‑D reconstruction fast enough for interactive applications while keeping the quality of much slower, optimization‑heavy pipelines.

Key Contributions

Unified feed‑forward pipeline that outputs a full 4‑D representation (geometry + motion + camera pose) from only two unposed images.
Dynamic 3‑D Gaussian splats as the core primitive, enabling differentiable rendering of color, depth, and optical flow from a single representation.
Self‑supervised training using a multi‑modal image synthesis loss (RGB, depth, flow) that tightly couples appearance, geometry, and motion, dramatically reducing the need for ground‑truth 4‑D data.
State‑of‑the‑art performance: up to 3× improvement over prior methods on joint geometry, motion, and pose benchmarks.
High‑fidelity 4‑D interpolation: the learned Gaussian cloud can be rendered from novel viewpoints and intermediate time steps, opening the door to smooth view‑synthesis and motion‑editing.

Methodology

Input & Encoder – Two RGB images (no known intrinsics/extrinsics) are passed through a shared CNN backbone that extracts multi‑scale feature maps.
Gaussian Prediction Head – From the fused features, the network predicts a set of 3‑D Gaussian parameters:
- Mean position (3‑D location)
- Covariance (shape & orientation)
- Appearance (RGB color)
- Velocity (3‑D motion vector)
Differentiable Rendering Layer – The predicted Gaussian cloud is rendered three ways:
- Color image (standard rasterization)
- Depth map (projected distance)
- Optical flow (temporal displacement of each splat)
  All three renderings are fully differentiable, allowing gradients to flow back to the Gaussian parameters.
Self‑Supervised Loss – The rendered outputs are compared against the original input images and a photometric consistency term across time, yielding a combined loss that simultaneously optimizes geometry, motion, and pose. Because the same Gaussian set produces all modalities, improving one (e.g., depth) automatically regularizes the others (e.g., flow).
Pose Estimation – Camera extrinsics are treated as learnable variables during training; the differentiable renderer back‑propagates pose errors, enabling the network to infer camera motion jointly with the scene dynamics.

The whole pipeline runs in a single forward pass at inference time, typically taking tens of milliseconds on a modern GPU.

Results & Findings

Metric	Prior Feed‑Forward (e.g., D‑NeRF)	UFO‑4D (ours)
3‑D Geometry (Chamfer)	0.032	0.011
Motion (EPE)	5.8 px	2.1 px
Camera Pose (°)	3.4	1.1

Joint Accuracy: The unified loss yields a balanced improvement across all three tasks, rather than excelling at one and sacrificing the others.
Speed: No per‑scene optimization; inference runs at ~30 fps for 640×480 inputs, compared to minutes‑long optimization loops in classic NeRF‑style methods.
Generalization: Trained on a modest synthetic + real dataset, UFO‑4D still performs well on unseen indoor/outdoor scenes, thanks to the strong regularization from multi‑modal supervision.
4‑D Interpolation: Rendering intermediate time steps produces smooth, artifact‑free motion blur and view synthesis, demonstrating the expressive power of the Gaussian splat representation.

Practical Implications

Rapid Prototyping for AR/VR – Developers can capture a scene with just two handheld photos and instantly obtain a navigable, animated 3‑D model for immersive experiences.
Robotics & Autonomous Navigation – Real‑time dense mapping of dynamic environments (e.g., moving people or vehicles) becomes feasible without expensive SLAM pipelines.
Content Creation – Film and game studios can generate low‑cost 4‑D assets for background plates or quick mock‑ups, cutting down on manual rigging.
Surveillance & Forensics – Quick reconstruction of a scene’s geometry and motion from a pair of security camera frames can aid incident analysis.
Edge Deployment – Because the model is feed‑forward and lightweight, it can run on modern mobile GPUs or edge AI accelerators, enabling on‑device 4‑D capture.

Limitations & Future Work

Scene Scale & Complexity – Extremely large or highly cluttered scenes still challenge the fixed‑size Gaussian cloud; scaling the number of splats or using hierarchical representations is an open direction.
Texture Fidelity – While geometry and motion are accurate, fine‑grained texture details can be blurry compared to optimization‑based NeRFs.
Assumption of Rigid Camera Motion – The current pose estimator works best when the two views are captured with a smooth, mostly rigid motion; rapid handheld shake may degrade results.
Training Data – Although self‑supervised, the model benefits from a curated mix of synthetic and real sequences; expanding to fully unsupervised, in‑the‑wild data remains future work.

Overall, UFO‑4D demonstrates that dense, dynamic 3‑D reconstruction can be both fast and accurate using a single unified network—a promising step toward making 4‑D perception a practical tool for developers across many domains.

Authors

Junhwa Hur
Charles Herrmann
Songyou Peng
Philipp Henzler
Zeyu Ma
Todd Zickler
Deqing Sun

Paper Information

arXiv ID: 2602.24290v1
Categories: cs.CV
Published: February 27, 2026
PDF: Download PDF

[Paper] UFO-4D: Unposed Feedforward 4D Reconstruction from Two Images

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Mode Seeking meets Mean Seeking for Fast Long Video Generation

[Paper] Hierarchical Action Learning for Weakly-Supervised Action Segmentation

[Paper] Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding Models

[Paper] Histopathology Image Normalization via Latent Manifold Compaction