[Paper] Dexterous World Models

Published: 1 month ago (December 19, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.17907v1

Overview

The paper “Dexterous World Models” presents a novel video‑diffusion system that can turn a static 3‑D reconstruction of a room into a dynamic, interactive scene driven by egocentric hand motions. By feeding a rendered scene and a sequence of hand‑mesh frames into the model, it produces temporally coherent videos that show realistic human‑object interactions—grasping, opening, moving objects—while keeping the camera viewpoint and scene geometry consistent. This bridges the gap between high‑fidelity digital twins and embodied interactivity, opening new possibilities for simulation, training, and content creation.

Key Contributions

Scene‑action‑conditioned diffusion model (DWM) that generates videos of dexterous hand interactions within a static 3‑D environment.
Dual conditioning strategy: (1) static scene renderings following a prescribed camera trajectory for spatial consistency, and (2) egocentric hand‑mesh renderings that encode geometry and motion cues for action‑driven dynamics.
Hybrid interaction dataset combining synthetic egocentric videos (perfect alignment of hand, camera, and objects) with real‑world fixed‑camera recordings (rich, realistic object physics).
Demonstration of physically plausible interactions (grasp, pull, open, push) that respect both hand kinematics and scene constraints, something prior digital‑twin pipelines lacked.
First video‑diffusion framework that can be used as an “embodied simulation engine” for generating interactive digital twins from egocentric action inputs.

Methodology

Input Representation
- Static scene rendering: a rasterized view of the 3‑D environment captured along a user‑specified camera path.
- Egocentric hand mesh sequence: per‑frame hand geometry rendered from the wearer’s perspective, providing both shape and motion information.
Diffusion Video Generator
- Built on a latent video diffusion backbone (similar to Imagen Video / Stable Diffusion Video).
- The diffusion process is conditioned on the concatenated scene and hand embeddings at each timestep, ensuring the generated frames stay aligned with the underlying geometry.
Training Data Construction
- Synthetic egocentric clips: generated in a physics engine where hand meshes, object poses, and camera trajectories are perfectly synchronized. These give the model exact supervision for how hand actions affect objects.
- Real‑world fixed‑camera clips: captured with a static camera in everyday environments, providing diverse object dynamics and textures. The model learns to generalize from these to unseen scenes.
Losses & Regularization
- Standard diffusion denoising loss plus a spatial‑temporal consistency loss that penalizes drift between the rendered scene and generated video.
- A physics‑aware regularizer encourages plausible contact forces (e.g., objects don’t pass through the hand).
Inference
- Users supply a 3‑D scene file, a camera trajectory, and a hand‑motion capture (e.g., from a glove or a motion‑capture system).
- DWM iteratively denoises a latent video, outputting a high‑resolution, temporally smooth video of the interaction.

Results & Findings

Qualitative: The generated videos show smooth hand‑object contact, realistic object deformation (e.g., a drawer sliding open), and consistent lighting/shadowing with the static scene.
Quantitative:
- Pose consistency (hand and object) improved by ~25 % over baseline video‑diffusion models that lack dual conditioning.
- Physical plausibility metric (based on a learned contact classifier) increased from 0.62 to 0.84.
- User study: 87 % of participants rated DWM videos as “believable” compared to 53 % for prior methods.
Ablation: Removing the hand‑mesh conditioning caused the model to hallucinate unrealistic object motions; dropping the static‑scene conditioning led to camera drift and broken spatial coherence.

Practical Implications

Interactive Content Creation: Game studios and AR/VR developers can generate high‑quality interaction footage without hand‑animating every object, simply by feeding motion‑capture data.
Robotics & Simulation: DWM can serve as a fast, visual simulator for training policies that need to understand the visual consequences of dexterous manipulation in realistic environments.
Digital Twin Maintenance: Facility managers could preview how a worker’s actions (e.g., opening a valve) would look in a digital replica, aiding training and safety analysis.
E‑learning & Remote Collaboration: Instructors can demonstrate complex manual procedures (assembly, repair) within a virtual replica of the actual workspace, using only hand‑track data.

Limitations & Future Work

Physics Fidelity: While visually plausible, the model does not enforce strict physical laws (e.g., conservation of momentum), limiting its use for high‑precision engineering simulations.
Generalization to Unseen Objects: Performance drops when the target object’s geometry or material properties differ drastically from the training set; future work could integrate a learned physics engine or object‑aware embeddings.
Real‑time Capability: Current diffusion inference is still computationally heavy (several seconds per second of video). Optimizations such as latent‑space distillation or hybrid autoregressive‑diffusion pipelines are needed for interactive applications.
Hand‑tracking Accuracy: The system assumes reasonably accurate egocentric hand meshes; noisy or low‑resolution capture can degrade output quality. Incorporating uncertainty modeling could make DWM more robust.

Dexterous World Models marks a significant step toward truly interactive digital twins, turning static 3‑D scans into living, manipulable environments driven by human actions. As diffusion models continue to accelerate, we can expect tighter integration with physics simulators and real‑time pipelines, making embodied simulation a mainstream tool for developers across gaming, robotics, and enterprise VR.

Authors

Byungjun Kim
Taeksoo Kim
Junyoung Lee
Hanbyul Joo

Paper Information

arXiv ID: 2512.17907v1
Categories: cs.CV
Published: December 19, 2025
PDF: Download PDF

[Paper] Dexterous World Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Adversarial Robustness of Vision in Open Foundation Models

[Paper] Diffusion Forcing for Multi-Agent Interaction Sequence Modeling