[Paper] WHOLE: 월드-그라운디드 Hand-Object Lifted from Egocentric Videos

발행: 3일 전 (2026년 2월 26일 오전 03:59 GMT+9)

7 분 소요

원문: arXiv

Source: arXiv - 2602.22209v1

개요

논문 WHOLE: World‑Grounded Hand‑Object Lifted from Egocentric Videos는 컴퓨터 비전 분야의 오랜 문제인 1인칭(자기 시점) 비디오 스트림에서 정확한 3‑D 손과 객체 움직임을 추출하는 문제를 다룹니다. 손‑객체 동역학의 공동 생성 모델을 학습함으로써, 저자들은 객체가 시야에서 사라지거나 크게 가려지는 경우에도 일관된 세계 좌표계에서 두 개체를 재구성할 수 있습니다.

주요 기여

Joint generative prior over hand‑object motion that captures realistic interaction dynamics, rather than treating hands and objects independently. → 손‑물체 움직임에 대한 Joint generative prior는 손과 물체를 독립적으로 다루는 것이 아니라 현실적인 상호작용 역학을 포착합니다.
World‑space reconstruction from egocentric video, enabling a unified 6‑DoF pose for both hand and object relative to a global frame. → World‑space reconstruction은 egocentric 비디오로부터 전역 프레임에 상대적인 손과 물체 모두에 대한 통합 6‑DoF 자세를 가능하게 합니다.
Observation‑guided sampling at test time: the pretrained prior is steered by video cues to produce trajectories that match the observed frames. → 테스트 시 Observation‑guided sampling을 통해 사전 학습된 prior가 비디오 단서에 의해 안내되어 관측된 프레임에 맞는 궤적을 생성합니다.
State‑of‑the‑art performance on benchmark datasets for hand motion, 6‑D object pose, and hand‑object relational accuracy. → 손 움직임, 6‑D 물체 자세, 손‑물체 관계 정확도에 대한 벤치마크 데이터셋에서 State‑of‑the‑art performance를 달성합니다.
Open‑source release of code, pretrained models, and a demo website, facilitating reproducibility and downstream research. → 코드, 사전 학습 모델, 데모 웹사이트를 Open‑source release하여 재현성 및 후속 연구를 촉진합니다.

방법론

Data Representation – Each training example consists of an egocentric video clip, a known 3‑D mesh template of the manipulated object, and ground‑truth hand and object poses (obtained from motion‑capture rigs).
Generative Prior Network – A conditional variational auto‑encoder (CVAE) learns to sample plausible hand‑object trajectories given a short motion context. The latent space encodes physical constraints (e.g., contact, collision avoidance) learned from real interactions.
Observation Encoder – A lightweight CNN‑RNN pipeline extracts visual features (hand masks, object silhouettes, optical flow) from the video and produces a conditioning vector for the prior.
Guided Sampling at Inference – Starting from the prior’s mean trajectory, the system iteratively refines the latent code using gradient‑based optimization so that the rendered hand‑object poses align with the observed video frames (e.g., matching 2‑D keypoints, silhouette overlap).
World‑Space Alignment – Because the prior operates in a canonical world frame, the final output directly yields 6‑D object poses and MANO hand parameters in a global coordinate system, eliminating the need for post‑hoc registration.

결과 및 발견

Hand Motion – WHOLE은 EPIC‑KITCHENS 이고센트릭 벤치마크에서 최고의 손‑전용 베이스라인에 비해 평균 관절당 오류를 ~15 % 감소시킵니다.
Object Pose – 6‑D 객체 자세 오류가 ≈12 cm / 15°(이전 방법)에서 **≈7 cm / 9°**로 감소하며, 객체가 클립의 최대 30 % 동안 완전히 가려진 경우에도 적용됩니다.
Interaction Consistency – 공동 재구성은 손‑객체 접촉 정확도를 30 % 향상시켜, 예측된 그립이 실제 접촉 지점과 훨씬 더 잘 일치함을 의미합니다.
Ablation Studies – 생성 사전(generative prior)이나 관찰‑기반 정제(observation‑guided refinement)를 제거하면 각각 성능이 크게 하락하여 두 구성 요소가 모두 필수적임을 확인합니다.

실용적 함의

AR/VR Interaction – 헤드‑마운트 카메라에서 실시간 손‑물체 추적이 가능해져 외부 센서 없이도 보다 몰입감 있는 조작 경험을 제공한다.
Robotics Imitation Learning – 로봇은 저가의 egocentric 디바이스로 촬영된 인간 시연 영상을 통해 학습할 수 있으며, WHOLE은 조작자와 대상 물체 모두에 대한 신뢰할 수 있는 3‑D 궤적을 제공한다.
Activity Recognition & Analytics – 정확한 월드‑스페이스 재구성은 요리 보조, 조립 지침, 작업장 안전 모니터링 등 하위 작업들을 향상시킨다.
Content Creation – 게임 개발자와 VFX 아티스트는 1인칭 영상에서 모션 캡처‑급 손‑물체 데이터를 자동으로 추출하여 고가의 스튜디오 장비 필요성을 줄일 수 있다.

제한 사항 및 향후 연구

템플릿 의존성 – WHOLE은 객체의 알려진 3‑D 메시가 필요합니다; 새로운, 보지 못한 객체를 다루는 것은 아직 해결되지 않은 과제입니다.
계산 비용 – 가이드된 샘플링 루프가 지연을 추가합니다 (≈200 ms per clip), 이는 저지연 AR 애플리케이션에는 여전히 너무 높습니다.
다양한 도메인에 대한 일반화 – 모델은 주방 유형 상호작용에 대해 학습되었습니다; 야외 또는 산업 환경으로 확장하려면 추가 데이터와 도메인‑특화 사전 지식이 필요할 수 있습니다.
향후 방향 – 저자들은 템플릿 요구사항을 완화하기 위해 학습된 객체‑형태 추정기를 통합하고, 실시간 성능을 위해 추론 파이프라인을 최적화하며, 다중 인물 egocentric 시나리오를 탐색할 것을 제안합니다.

저자

Yufei Ye
Jiaman Li
Ryan Rong
C. Karen Liu

논문 정보

arXiv ID: 2602.22209v1
카테고리: cs.CV
출판일: 2026년 2월 25일
PDF: PDF 다운로드

[Paper] WHOLE: 월드-그라운디드 Hand-Object Lifted from Egocentric Videos

개요

주요 기여

방법론

결과 및 발견

실용적 함의

제한 사항 및 향후 연구

저자

논문 정보

관련 글

[Paper] MediX‑R1: 개방형 의료 강화 학습

[Paper] VGG‑T³: 대규모 오프라인 피드포워드 3D 재구성

[Paper] SeeThrough3D: 폐색 인식 3D 제어를 이용한 텍스트-이미지 생성

[Paper] 센서 일반화를 위한 적응형 센싱 및 이벤트 기반 객체 감지의 공동 분포 학습