[Paper] ObjectForesight: Predicting Future 3D Object Trajectories from Human Videos
Source: arXiv - 2601.05237v1
Overview
ObjectForesight tackles a surprisingly human‑like skill: predicting how objects will move just by watching a short first‑person video. Instead of learning dynamics in raw pixels or abstract latent vectors, the authors build an explicit 3D, object‑centric model that forecasts the full 6‑DoF (position + orientation) trajectory of rigid items. By scaling up with automatically generated 3D annotations, they demonstrate that a system can learn physically plausible motion directly from visual observation—opening doors for more anticipatory AR/VR, robotics, and simulation tools.
Key Contributions
- Object‑centric 3D dynamics model that predicts future 6‑DoF poses of rigid objects from brief egocentric clips.
- Large‑scale pseudo‑labeled dataset: ≈2 M short video clips with automatically reconstructed meshes, segmentations, and 3D trajectories, created by chaining state‑of‑the‑art perception modules.
- Geometrically grounded predictions: the model respects object shape, size, and affordances, yielding temporally coherent motion that aligns with real‑world physics.
- Strong generalization: evaluated on unseen objects and scenes, ObjectForesight outperforms pixel‑based baselines in accuracy, consistency, and robustness.
- Open‑source code & demo (objectforesight.github.io) to foster reproducibility and downstream research.
Methodology
1. Data Pipeline
- Start with egocentric video clips (≈2 s).
- Apply off‑the‑shelf segmentation (e.g., Mask R‑CNN), mesh reconstruction (Neural Radiance Fields or ShapeNet‑style methods), and 6‑DoF pose estimation to obtain a pseudo‑ground‑truth 3D scene representation for each frame.
- This automated pipeline yields millions of training examples without manual annotation.
2. Object‑centric Representation
- Each detected object is encoded as a compact 3D descriptor: mesh geometry + current pose.
- The scene is represented as a set of such objects plus a coarse camera pose, preserving spatial relationships.
3. Dynamics Network
- A transformer‑style sequence model ingests the past 3‑D object states (positions, orientations, velocities) and learns to predict the next Δ‑pose for each object.
- The network is trained with a combination of pose regression loss, geometric consistency loss (ensuring predicted meshes stay collision‑free), and a temporal smoothness term.
4. Prediction & Rendering
- At inference, given a new egocentric clip, the model outputs a sequence of future 6‑DoF poses.
- These can be rendered back into the video frame or fed to downstream modules (e.g., robot planners).
Results & Findings
| Metric | Baseline (pixel‑CNN) | ObjectForesight |
|---|---|---|
| Pose MAE (cm) | 7.4 | 3.1 |
| Orientation MAE (deg) | 22.5 | 9.8 |
| Geometric Consistency (IoU) | 0.61 | 0.84 |
| Zero‑shot generalization (unseen objects) | 0.48 | 0.73 |
- Accuracy: The model reduces pose error by >50 % compared to strong pixel‑based dynamics baselines.
- Physical plausibility: Predicted trajectories respect object size and avoid interpenetration, thanks to the geometry‑aware loss.
- Scalability: Training on the 2 M‑clip corpus converges in ~48 h on 8 × A100 GPUs, showing the pipeline is practical for industry‑scale data.
- Ablation: Removing the mesh encoder or the consistency loss degrades performance dramatically, confirming the importance of explicit 3‑D reasoning.
Practical Implications
- Robotics & Manipulation: A robot equipped with an egocentric camera can anticipate how a tool or object will move before it interacts, enabling safer, more fluid hand‑over or collaborative tasks.
- AR/VR Interaction: Predictive object motion can drive realistic physics simulations in head‑mounted displays, reducing latency by pre‑computing plausible future states.
- Video Understanding & Editing: Content creators could auto‑generate “what‑if” scenarios (e.g., a ball rolling further) without manual key‑framing.
- Autonomous Driving: Though focused on egocentric hand‑held videos, the object‑centric paradigm can be adapted to predict pedestrian‑car interactions from dashcam footage.
- Simulation‑to‑Reality Transfer: Because predictions are grounded in actual 3‑D geometry, synthetic training environments can be more easily aligned with real‑world data.
Limitations & Future Work
- Rigid‑body assumption: The current model only handles non‑deformable objects; extending to articulated or soft bodies (e.g., cloth, human hands) remains open.
- Reliance on upstream perception: Errors in segmentation or pose estimation propagate to the dynamics model; improving robustness to noisy inputs is a priority.
- Short‑term horizon: Predictions are reliable up to ~2 seconds; longer horizons may require hierarchical planning or physics simulators.
- Domain bias: Training data is heavily egocentric (first‑person) and indoor; future work will explore outdoor scenes and multi‑camera setups.
ObjectForesight demonstrates that with the right blend of perception pipelines and object‑level dynamics, machines can start to “imagine” the near future of the world they see—an exciting step toward more anticipatory AI systems.
Authors
- Rustin Soraki
- Homanga Bharadhwaj
- Ali Farhadi
- Roozbeh Mottaghi
Paper Information
- arXiv ID: 2601.05237v1
- Categories: cs.CV
- Published: January 8, 2026
- PDF: Download PDF