[Paper] Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control

Published: (February 20, 2026 at 01:45 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.18422v1

Overview

The paper presents Generated Reality, a new approach for creating interactive, egocentric video worlds that react to a user’s head orientation and detailed hand movements. By coupling 3D motion tracking with a diffusion‑based video generator, the authors enable realistic hand‑object interactions in XR experiences—something current text‑ or keyboard‑controlled video models can’t do.

Key Contributions

  • Human‑centric conditioning: Introduces a novel way to feed 3‑D head pose and joint‑level hand pose data into a diffusion transformer, allowing fine‑grained control of the generated scene.
  • Bidirectional teacher model: Trains a powerful, non‑causal video diffusion model that can look both forward and backward in time to learn high‑quality dynamics.
  • Causal distillation pipeline: Distills the bidirectional teacher into a fast, causal (real‑time) model suitable for interactive XR applications.
  • Empirical validation: Conducts user studies showing that participants complete tasks faster and feel more in control compared with baseline video generators.
  • Open‑source implementation: Releases code, pretrained weights, and a demo pipeline for the community to build upon.

Methodology

  1. Data Representation

    • Head pose: 6‑DoF (position + orientation) captured from an XR headset.
    • Hand pose: 21‑joint skeletal data from hand‑tracking cameras or gloves.
    • Both are encoded as continuous vectors and concatenated with a temporal token for each video frame.
  2. Conditioning Strategy

    • The authors evaluate several diffusion transformer conditioning schemes (cross‑attention, concatenation, FiLM) and find that cross‑attention with learned pose embeddings yields the most stable control.
  3. Teacher Model (Bidirectional Diffusion)

    • A video diffusion transformer is trained on a large egocentric video dataset (e.g., EPIC‑Kitchens) where each frame is paired with the corresponding pose data.
    • The model predicts noise both forward and backward in time, giving it a global view of motion dynamics.
  4. Distillation to a Causal Student

    • Using knowledge distillation, the teacher’s predictions are transferred to a causal diffusion model that only sees past frames, enabling real‑time generation as new pose inputs arrive.
  5. Interactive Loop

    • At runtime, the XR system streams live head/hand poses to the causal model, which instantly generates the next video frame, creating a seamless loop of perception‑action‑generation.

Results & Findings

MetricGenerated RealityText‑only BaselineKeyboard‑control Baseline
Task completion time (seconds)4.2 ± 0.86.7 ± 1.15.9 ± 0.9
Subjective control rating (1‑5)4.32.73.1
Visual fidelity (SSIM)0.780.710.73
Latency per frame28 ms22 ms24 ms
  • Participants could pick up, rotate, and place virtual objects using only their hands, achieving ~30 % faster task performance than baselines.
  • Survey responses indicated a significantly higher sense of agency, confirming that the fine‑grained pose conditioning feels natural.
  • Visual quality remained high despite the causal constraint, thanks to the teacher‑student distillation.

Practical Implications

  • XR Development: Game engines and AR platforms can plug in the causal model to generate responsive environments without hand‑crafting every asset, dramatically cutting content creation time.
  • Remote Collaboration: Telepresence systems can render a shared virtual workspace that mirrors each participant’s hand motions, enabling realistic object manipulation over bandwidth‑limited links.
  • Training Simulations: Industries such as manufacturing or surgery can build immersive simulators where trainees receive immediate visual feedback tied to their exact hand posture, improving skill transfer.
  • Assistive Tech: For users with limited mobility, the model could translate subtle head or hand gestures into rich visual cues, expanding accessibility in mixed‑reality interfaces.

Limitations & Future Work

  • Dataset Bias: The model is trained on kitchen‑style egocentric videos, so performance may degrade in domains with drastically different object geometries (e.g., outdoor scenes).
  • Hardware Requirements: Real‑time inference still needs a modern GPU; scaling to mobile XR headsets will require further model compression.
  • Long‑Term Consistency: While short interactions are stable, maintaining coherent object states over extended sequences remains challenging.
  • Future Directions: The authors plan to explore multi‑modal conditioning (audio, haptics), larger and more diverse video corpora, and lightweight architectures for on‑device deployment.

Authors

  • Linxi Xie
  • Lisong C. Sun
  • Ashley Neall
  • Tong Wu
  • Shengqu Cai
  • Gordon Wetzstein

Paper Information

  • arXiv ID: 2602.18422v1
  • Categories: cs.CV
  • Published: February 20, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »