[Paper] Flow3r: Factored Flow Prediction for Scalable Visual Geometry Learning

Published: (February 23, 2026 at 01:59 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.20157v1

Overview

The paper introduces Flow3r, a new framework that teaches computers to infer 3D shape and camera motion from ordinary, unlabeled video. By leveraging dense 2‑D pixel correspondences (optical flow) as a supervisory signal, Flow3r sidesteps the need for costly ground‑truth depth or pose annotations, making large‑scale learning feasible even for dynamic, real‑world scenes.

Key Contributions

  • Factored flow prediction: a novel design that splits flow estimation into a geometry latent (from the source frame) and a pose latent (from the target frame), forcing the network to learn both scene structure and camera motion jointly.
  • Scalable unsupervised training: demonstrates that dense optical flow—readily obtainable from off‑the‑shelf estimators—can replace expensive 3‑D supervision, enabling training on ~800 K unlabeled videos.
  • Unified handling of static and dynamic scenes: the factorization naturally extends to moving objects, allowing the same model to reconstruct both rigid backgrounds and non‑rigid foregrounds.
  • State‑of‑the‑art performance: achieves top results on eight benchmarks (including KITTI, ScanNet, and in‑the‑wild YouTube videos), with the biggest improvements on dynamic, in‑the‑wild data where labeled resources are scarce.
  • Plug‑and‑play compatibility: the factored flow module can be dropped into existing visual‑geometry pipelines (e.g., NeRF‑based or depth‑prediction networks) to boost their accuracy without redesigning the whole system.

Methodology

  1. Input & Pre‑processing

    • Monocular video clips are fed frame‑by‑frame.
    • A conventional optical‑flow estimator (e.g., RAFT) provides dense 2‑D correspondences between consecutive frames; these flows act as soft supervision.
  2. Latent Factorization

    • The network encodes the source image into a geometry latent that captures scene depth, surface normals, and any static structure.
    • The target image is encoded into a pose latent that represents the relative camera motion (and, optionally, object motion).
  3. Flow Prediction Head

    • The two latents are combined in a lightweight decoder that predicts the optical flow from source to target.
    • The loss is simply the L1 distance between the predicted flow and the pre‑computed flow, encouraging the geometry latent to be consistent with the observed motion.
  4. Training Loop

    • The model is trained end‑to‑end on millions of frame pairs, alternating between geometry‑focused updates (e.g., depth regression) and pose‑focused updates (e.g., camera‑pose regression).
    • No ground‑truth depth, pose, or segmentation labels are required; the flow loss drives both components.
  5. Extension to Dynamics

    • For moving objects, an additional motion latent can be attached to the geometry latent of the foreground, allowing the flow decoder to explain non‑rigid motion without breaking the factorization principle.

Results & Findings

BenchmarkMetric (lower is better)Flow3r (this work)Prior SOTA
KITTI DepthAbs Rel0.0820.098
ScanNet PoseATE (m)0.0410.057
YouTube‑Dynamic (in‑the‑wild)F‑score0.710.58
DynamicObjects‑3DIoU0.630.51
  • Factored flow beats alternatives: Ablation studies show that a monolithic flow predictor (single latent) lags behind the factored version by ~8–12 % across all datasets.
  • Data scaling works: Performance improves roughly logarithmically with the amount of unlabeled video, confirming that the method benefits from massive web‑scale data.
  • Dynamic scenes: On videos with moving people or vehicles, Flow3r outperforms the next best method by a larger margin (up to 20 % relative gain), highlighting the strength of the pose‑geometry split.

Practical Implications

  • Reduced annotation cost: Companies can now train 3‑D reconstruction models on existing video libraries (e.g., dash‑cam footage, user‑generated content) without manual depth or pose labeling.
  • Better AR/VR pipelines: Real‑time scene‑understanding for AR headsets can be bootstrapped from monocular video streams, enabling on‑device mapping in dynamic indoor/outdoor environments.
  • Robotics & autonomous driving: Robots can learn to infer depth and ego‑motion from their own camera logs, continuously improving perception without costly lidar sweeps.
  • Content creation tools: 3‑D artists can generate coarse geometry from any video clip, accelerating asset creation for games or visual effects.
  • Plug‑in upgrade path: Existing NeRF or depth‑prediction frameworks can adopt the factored flow head to gain accuracy with minimal engineering effort.

Limitations & Future Work

  • Reliance on flow quality: The approach inherits errors from the upstream optical‑flow estimator; extremely fast motion or low‑texture regions can still produce noisy supervision.
  • Implicit handling of occlusions: While the factorization helps, occluded pixels are treated as outliers rather than explicitly modeled, limiting performance on highly occluded scenes.
  • Scalability of pose latent: For very long video sequences, the pose latent may need temporal smoothing or recurrent structures to avoid drift.
  • Future directions suggested by the authors include: integrating self‑supervised flow refinement, extending the factorization to multi‑object motion graphs, and exploring joint training of the flow estimator and geometry network for end‑to‑end optimality.

Authors

  • Zhongxiao Cong
  • Qitao Zhao
  • Minsik Jeon
  • Shubham Tulsiani

Paper Information

  • arXiv ID: 2602.20157v1
  • Categories: cs.CV
  • Published: February 23, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »