[Paper] CRISP: Contact-Guided Real2Sim from Monocular Video with Planar Scene Primitives

Published: 1 month ago (December 16, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.14696v1

Overview

The paper presents CRISP, a pipeline that turns ordinary monocular video into a physics‑ready simulation of both the human performer and the surrounding environment. By fitting simple planar primitives to a point‑cloud reconstruction and leveraging contact cues from the person’s pose, CRISP produces clean, collision‑free geometry that can be fed directly into a reinforcement‑learning (RL) controller. The result is a dramatic drop in motion‑tracking failures and a speed‑up in simulation, opening the door for large‑scale real‑to‑sim pipelines in robotics, AR/VR, and interactive AI.

Key Contributions

Contact‑guided scene reconstruction – Uses human pose and contact points to infer occluded surfaces (e.g., the hidden part of a chair seat).
Planar primitive fitting – A lightweight clustering step over depth, surface normals, and optical flow yields convex, simulation‑ready geometry instead of noisy meshes.
Physics‑in‑the‑loop validation – The recovered human and scene are tested by driving a humanoid RL controller, ensuring physical plausibility.
Substantial performance gains – Reduces motion‑tracking failure from 55.2 % to 6.9 % on benchmark datasets and speeds up RL simulation by ~43 %.
Broad applicability – Demonstrated on controlled datasets (EMDB, PROX) as well as in‑the‑wild videos, Internet clips, and even AI‑generated (Sora) footage.

Methodology

Monocular video → dense point cloud
- Off‑the‑shelf multi‑view structure‑from‑motion (SfM) and depth‑estimation networks produce a per‑frame point cloud with associated surface normals and optical flow.
Clustering into planar primitives
- Points are grouped by similarity in depth, normal direction, and motion consistency.
- Each cluster is approximated by a convex planar primitive (e.g., a rectangle for a tabletop). This yields a tidy, low‑poly scene representation that is easy for physics engines to handle.
Contact‑guided occlusion completion
- Human pose estimation identifies contact points (feet on floor, hands on a chair, etc.).
- When a contact surface is partially hidden, the algorithm extrapolates the missing geometry using the known pose and the planar primitive model.
Human motion extraction
- A separate pose‑tracking network recovers the 3D skeleton over time.
- The skeleton is retargeted to a full‑body humanoid model with joint limits and dynamics.
Physics validation via RL
- The reconstructed scene and humanoid are fed into a reinforcement‑learning controller that tries to reproduce the observed motion.
- If the controller can follow the trajectory without collisions or instability, the reconstruction is accepted; otherwise, the pipeline iterates to refine geometry or contacts.

The whole process is fully automated and runs on a single GPU, making it practical for large video collections.

Results & Findings

Dataset	Baseline Failure Rate	CRISP Failure Rate	Speed‑up (RL steps/sec)
EMDB	55.2 %	6.9 %	+43 %
PROX	48.7 %	7.4 %	+41 %

Failure rate measures how often the RL controller could not reproduce the recorded motion due to geometry errors or interpenetrations.
Simulation throughput improves because planar primitives reduce collision‑checking complexity.
Qualitative tests on YouTube‑style clips and Sora‑generated videos show that CRISP can reconstruct plausible chairs, tables, and floors even when only a few frames show the object.

Overall, the authors demonstrate that a contact‑aware approach yields far more reliable and faster simulations than prior data‑driven, physics‑agnostic pipelines.

Practical Implications

Robotics – Robots can be trained in simulation on environments that look exactly like the real world captured by a single camera, reducing the “reality gap” for tasks such as household assistance or warehouse navigation.
AR/VR content creation – Game developers and XR designers can generate interactive scenes from consumer video footage without manual modeling, enabling rapid prototyping of immersive experiences.
Digital twins for safety analysis – Engineers can reconstruct a worker’s motion and surrounding equipment from surveillance footage to evaluate ergonomics or collision risk in a virtual sandbox.
Data‑efficient RL – Cleaner geometry means fewer physics violations, allowing RL agents to learn faster and with fewer simulation steps, cutting compute costs.

Because the pipeline works on “in‑the‑wild” videos, it can be scaled to massive public video archives, potentially creating a library of ready‑to‑simulate human‑environment interactions.

Limitations & Future Work

Planar assumption – The method excels on environments dominated by flat surfaces; highly curved or organic objects (e.g., sofas, plants) may be oversimplified.
Reliance on accurate pose & depth – Errors in the upstream pose estimator or depth network can propagate, especially in low‑light or fast‑motion clips.
Static scene focus – Dynamic objects (moving chairs, doors) are not explicitly modeled; extending CRISP to handle moving scene elements is an open challenge.
Scalability of contact inference – While contact cues help fill occlusions, complex multi‑person interactions may require more sophisticated reasoning.

Future directions include integrating learned shape priors for non‑planar objects, handling dynamic scene changes, and tightening the loop between RL feedback and geometry refinement for fully autonomous real‑to‑sim pipelines.

Authors

Zihan Wang
Jiashun Wang
Jeff Tan
Yiwen Zhao
Jessica Hodgins
Shubham Tulsiani
Deva Ramanan

Paper Information

arXiv ID: 2512.14696v1
Categories: cs.CV, cs.GR, cs.RO
Published: December 16, 2025
PDF: Download PDF

[Paper] CRISP: Contact-Guided Real2Sim from Monocular Video with Planar Scene Primitives

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Dexterous World Models

[Paper] Adversarial Robustness of Vision in Open Foundation Models