[Paper] WHOLE: World-Grounded Hand-Object Lifted from Egocentric Videos

Published: (February 25, 2026 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.22209v1

Overview

The paper WHOLE: World‑Grounded Hand‑Object Lifted from Egocentric Videos tackles a long‑standing problem in computer vision: extracting accurate 3‑D hand and object motions from first‑person (egocentric) video streams. By learning a joint generative model of hand‑object dynamics, the authors can reconstruct both entities in a consistent world‑space coordinate system, even when the objects disappear from view or become heavily occluded.

Key Contributions

  • Joint generative prior over hand‑object motion that captures realistic interaction dynamics, rather than treating hands and objects independently.
  • World‑space reconstruction from egocentric video, enabling a unified 6‑DoF pose for both hand and object relative to a global frame.
  • Observation‑guided sampling at test time: the pretrained prior is steered by video cues to produce trajectories that match the observed frames.
  • State‑of‑the‑art performance on benchmark datasets for hand motion, 6‑D object pose, and hand‑object relational accuracy.
  • Open‑source release of code, pretrained models, and a demo website, facilitating reproducibility and downstream research.

Methodology

  1. Data Representation – Each training example consists of an egocentric video clip, a known 3‑D mesh template of the manipulated object, and ground‑truth hand and object poses (obtained from motion‑capture rigs).
  2. Generative Prior Network – A conditional variational auto‑encoder (CVAE) learns to sample plausible hand‑object trajectories given a short motion context. The latent space encodes physical constraints (e.g., contact, collision avoidance) learned from real interactions.
  3. Observation Encoder – A lightweight CNN‑RNN pipeline extracts visual features (hand masks, object silhouettes, optical flow) from the video and produces a conditioning vector for the prior.
  4. Guided Sampling at Inference – Starting from the prior’s mean trajectory, the system iteratively refines the latent code using gradient‑based optimization so that the rendered hand‑object poses align with the observed video frames (e.g., matching 2‑D keypoints, silhouette overlap).
  5. World‑Space Alignment – Because the prior operates in a canonical world frame, the final output directly yields 6‑D object poses and MANO hand parameters in a global coordinate system, eliminating the need for post‑hoc registration.

Results & Findings

  • Hand Motion – WHOLE reduces mean per‑joint error by ~15 % compared to the best hand‑only baselines on the EPIC‑KITCHENS egocentric benchmark.
  • Object Pose – 6‑D object pose error drops from ≈12 cm / 15° (previous methods) to ≈7 cm / 9°, even when the object is fully occluded for up to 30 % of the clip.
  • Interaction Consistency – The joint reconstruction yields a 30 % improvement in hand‑object contact accuracy, meaning the predicted grasps align much better with the true contact points.
  • Ablation Studies – Removing the generative prior or the observation‑guided refinement each causes a steep performance decline, confirming that both components are essential.

Practical Implications

  • AR/VR Interaction – Real‑time hand‑object tracking from head‑mounted cameras becomes feasible, enabling more immersive manipulation experiences without external sensors.
  • Robotics Imitation Learning – Robots can learn from human demonstration videos captured with cheap egocentric devices, as WHOLE provides reliable 3‑D trajectories for both the manipulator and the target object.
  • Activity Recognition & Analytics – Accurate world‑space reconstructions improve downstream tasks such as cooking assistance, assembly instructions, or workplace safety monitoring.
  • Content Creation – Game developers and VFX artists can automatically extract motion capture‑grade hand‑object data from first‑person footage, reducing the need for expensive studio rigs.

Limitations & Future Work

  • Template Dependency – WHOLE requires a known 3‑D mesh of the object; handling novel, unseen objects remains an open challenge.
  • Computation Cost – The guided sampling loop adds latency (≈200 ms per clip), which is still too high for low‑latency AR applications.
  • Generalization to Diverse Domains – The model is trained on kitchen‑type interactions; extending to outdoor or industrial settings may need additional data and domain‑specific priors.
  • Future Directions – The authors suggest integrating a learned object‑shape estimator to relax the template requirement, optimizing the inference pipeline for real‑time performance, and exploring multi‑person egocentric scenarios.

Authors

  • Yufei Ye
  • Jiaman Li
  • Ryan Rong
  • C. Karen Liu

Paper Information

  • arXiv ID: 2602.22209v1
  • Categories: cs.CV
  • Published: February 25, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] A Dataset is Worth 1 MB

A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on divers...