[Paper] Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control

Published: 2 months ago (February 20, 2026 at 01:45 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.18422v1

Overview

The paper presents Generated Reality, a new approach for creating interactive, egocentric video worlds that react to a user’s head orientation and detailed hand movements. By coupling 3D motion tracking with a diffusion‑based video generator, the authors enable realistic hand‑object interactions in XR experiences—something current text‑ or keyboard‑controlled video models can’t do.

Key Contributions

Human‑centric conditioning: Introduces a novel way to feed 3‑D head pose and joint‑level hand pose data into a diffusion transformer, allowing fine‑grained control of the generated scene.
Bidirectional teacher model: Trains a powerful, non‑causal video diffusion model that can look both forward and backward in time to learn high‑quality dynamics.
Causal distillation pipeline: Distills the bidirectional teacher into a fast, causal (real‑time) model suitable for interactive XR applications.
Empirical validation: Conducts user studies showing that participants complete tasks faster and feel more in control compared with baseline video generators.
Open‑source implementation: Releases code, pretrained weights, and a demo pipeline for the community to build upon.

Methodology

Data Representation
- Head pose: 6‑DoF (position + orientation) captured from an XR headset.
- Hand pose: 21‑joint skeletal data from hand‑tracking cameras or gloves.
- Both are encoded as continuous vectors and concatenated with a temporal token for each video frame.
Conditioning Strategy
- The authors evaluate several diffusion transformer conditioning schemes (cross‑attention, concatenation, FiLM) and find that cross‑attention with learned pose embeddings yields the most stable control.
Teacher Model (Bidirectional Diffusion)
- A video diffusion transformer is trained on a large egocentric video dataset (e.g., EPIC‑Kitchens) where each frame is paired with the corresponding pose data.
- The model predicts noise both forward and backward in time, giving it a global view of motion dynamics.
Distillation to a Causal Student
- Using knowledge distillation, the teacher’s predictions are transferred to a causal diffusion model that only sees past frames, enabling real‑time generation as new pose inputs arrive.
Interactive Loop
- At runtime, the XR system streams live head/hand poses to the causal model, which instantly generates the next video frame, creating a seamless loop of perception‑action‑generation.

Results & Findings

Metric	Generated Reality	Text‑only Baseline	Keyboard‑control Baseline
Task completion time (seconds)	4.2 ± 0.8	6.7 ± 1.1	5.9 ± 0.9
Subjective control rating (1‑5)	4.3	2.7	3.1
Visual fidelity (SSIM)	0.78	0.71	0.73
Latency per frame	28 ms	22 ms	24 ms

Participants could pick up, rotate, and place virtual objects using only their hands, achieving ~30 % faster task performance than baselines.
Survey responses indicated a significantly higher sense of agency, confirming that the fine‑grained pose conditioning feels natural.
Visual quality remained high despite the causal constraint, thanks to the teacher‑student distillation.

Practical Implications

XR Development: Game engines and AR platforms can plug in the causal model to generate responsive environments without hand‑crafting every asset, dramatically cutting content creation time.
Remote Collaboration: Telepresence systems can render a shared virtual workspace that mirrors each participant’s hand motions, enabling realistic object manipulation over bandwidth‑limited links.
Training Simulations: Industries such as manufacturing or surgery can build immersive simulators where trainees receive immediate visual feedback tied to their exact hand posture, improving skill transfer.
Assistive Tech: For users with limited mobility, the model could translate subtle head or hand gestures into rich visual cues, expanding accessibility in mixed‑reality interfaces.

Limitations & Future Work

Dataset Bias: The model is trained on kitchen‑style egocentric videos, so performance may degrade in domains with drastically different object geometries (e.g., outdoor scenes).
Hardware Requirements: Real‑time inference still needs a modern GPU; scaling to mobile XR headsets will require further model compression.
Long‑Term Consistency: While short interactions are stable, maintaining coherent object states over extended sequences remains challenging.
Future Directions: The authors plan to explore multi‑modal conditioning (audio, haptics), larger and more diverse video corpora, and lightweight architectures for on‑device deployment.

Authors

Linxi Xie
Lisong C. Sun
Ashley Neall
Tong Wu
Shengqu Cai
Gordon Wetzstein

Paper Information

arXiv ID: 2602.18422v1
Categories: cs.CV
Published: February 20, 2026
PDF: Download PDF

[Paper] Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

Gemini app adds video templates to quick start generation

[Paper] Exploiting Completeness Perception with Diffusion Transformer for Unified 3D MRI Synthesis

[Paper] Catastrophic Forgetting Resilient One-Shot Incremental Federated Learning

[Paper] Art2Mus: Artwork-to-Music Generation via Visual Conditioning and Large-Scale Cross-Modal Alignment