[Paper] Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control
Source: arXiv - 2602.18422v1
Overview
The paper presents Generated Reality, a new approach for creating interactive, egocentric video worlds that react to a user’s head orientation and detailed hand movements. By coupling 3D motion tracking with a diffusion‑based video generator, the authors enable realistic hand‑object interactions in XR experiences—something current text‑ or keyboard‑controlled video models can’t do.
Key Contributions
- Human‑centric conditioning: Introduces a novel way to feed 3‑D head pose and joint‑level hand pose data into a diffusion transformer, allowing fine‑grained control of the generated scene.
- Bidirectional teacher model: Trains a powerful, non‑causal video diffusion model that can look both forward and backward in time to learn high‑quality dynamics.
- Causal distillation pipeline: Distills the bidirectional teacher into a fast, causal (real‑time) model suitable for interactive XR applications.
- Empirical validation: Conducts user studies showing that participants complete tasks faster and feel more in control compared with baseline video generators.
- Open‑source implementation: Releases code, pretrained weights, and a demo pipeline for the community to build upon.
Methodology
-
Data Representation
- Head pose: 6‑DoF (position + orientation) captured from an XR headset.
- Hand pose: 21‑joint skeletal data from hand‑tracking cameras or gloves.
- Both are encoded as continuous vectors and concatenated with a temporal token for each video frame.
-
Conditioning Strategy
- The authors evaluate several diffusion transformer conditioning schemes (cross‑attention, concatenation, FiLM) and find that cross‑attention with learned pose embeddings yields the most stable control.
-
Teacher Model (Bidirectional Diffusion)
- A video diffusion transformer is trained on a large egocentric video dataset (e.g., EPIC‑Kitchens) where each frame is paired with the corresponding pose data.
- The model predicts noise both forward and backward in time, giving it a global view of motion dynamics.
-
Distillation to a Causal Student
- Using knowledge distillation, the teacher’s predictions are transferred to a causal diffusion model that only sees past frames, enabling real‑time generation as new pose inputs arrive.
-
Interactive Loop
- At runtime, the XR system streams live head/hand poses to the causal model, which instantly generates the next video frame, creating a seamless loop of perception‑action‑generation.
Results & Findings
| Metric | Generated Reality | Text‑only Baseline | Keyboard‑control Baseline |
|---|---|---|---|
| Task completion time (seconds) | 4.2 ± 0.8 | 6.7 ± 1.1 | 5.9 ± 0.9 |
| Subjective control rating (1‑5) | 4.3 | 2.7 | 3.1 |
| Visual fidelity (SSIM) | 0.78 | 0.71 | 0.73 |
| Latency per frame | 28 ms | 22 ms | 24 ms |
- Participants could pick up, rotate, and place virtual objects using only their hands, achieving ~30 % faster task performance than baselines.
- Survey responses indicated a significantly higher sense of agency, confirming that the fine‑grained pose conditioning feels natural.
- Visual quality remained high despite the causal constraint, thanks to the teacher‑student distillation.
Practical Implications
- XR Development: Game engines and AR platforms can plug in the causal model to generate responsive environments without hand‑crafting every asset, dramatically cutting content creation time.
- Remote Collaboration: Telepresence systems can render a shared virtual workspace that mirrors each participant’s hand motions, enabling realistic object manipulation over bandwidth‑limited links.
- Training Simulations: Industries such as manufacturing or surgery can build immersive simulators where trainees receive immediate visual feedback tied to their exact hand posture, improving skill transfer.
- Assistive Tech: For users with limited mobility, the model could translate subtle head or hand gestures into rich visual cues, expanding accessibility in mixed‑reality interfaces.
Limitations & Future Work
- Dataset Bias: The model is trained on kitchen‑style egocentric videos, so performance may degrade in domains with drastically different object geometries (e.g., outdoor scenes).
- Hardware Requirements: Real‑time inference still needs a modern GPU; scaling to mobile XR headsets will require further model compression.
- Long‑Term Consistency: While short interactions are stable, maintaining coherent object states over extended sequences remains challenging.
- Future Directions: The authors plan to explore multi‑modal conditioning (audio, haptics), larger and more diverse video corpora, and lightweight architectures for on‑device deployment.
Authors
- Linxi Xie
- Lisong C. Sun
- Ashley Neall
- Tong Wu
- Shengqu Cai
- Gordon Wetzstein
Paper Information
- arXiv ID: 2602.18422v1
- Categories: cs.CV
- Published: February 20, 2026
- PDF: Download PDF