[Paper] MessyKitchens: Contact-rich object-level 3D scene reconstruction

Published: 3 days ago (March 17, 2026 at 01:59 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2603.16868v1

Overview

The MessyKitchens paper tackles a long‑standing bottleneck in 3D vision: reconstructing cluttered, real‑world scenes at the level of individual objects while preserving physically plausible contacts (no inter‑penetration). By releasing a high‑quality dataset of messy kitchen environments and a new multi‑object reconstruction model, the authors push monocular 3D scene understanding closer to the needs of robotics, AR/VR, and game development.

Key Contributions

MessyKitchens dataset – 1,200+ real kitchen scans with per‑object 3D meshes, precise poses, and annotated contact maps, far surpassing prior benchmarks in realism and annotation fidelity.
Multi‑Object Decoder (MOD) – an extension of the SAM‑3D single‑object pipeline that jointly predicts shapes, poses, and contact constraints for all objects in a scene.
Physical plausibility layer – a differentiable non‑penetration loss that explicitly enforces realistic object contacts during training.
Comprehensive evaluation – demonstrates >30 % reduction in inter‑object penetration and up to 15 % boost in pose/shape registration accuracy across three public datasets (including ScanNet and 3RScan).
Open‑source release – dataset, training code, and pre‑trained MOD models are made publicly available, enabling immediate experimentation.

Methodology

Data Capture & Annotation
- Kitchens are photographed with a single RGB camera while a handheld 3D scanner captures dense point clouds.
- A semi‑automatic pipeline aligns the scans to the images, extracts individual object meshes, and computes contact surfaces via mesh intersection analysis.
Base Architecture (SAM‑3D)
- A transformer‑based encoder ingests a single RGB image and produces a latent representation for each detected object region (using a pretrained Mask‑RCNN detector).
- The original SAM‑3D decoder reconstructs a single object’s shape and pose from its latent code.
Multi‑Object Decoder (MOD)
- Shared latent space: All object latents are concatenated and passed through a cross‑attention module, allowing objects to “talk” to each other.
- Contact‑aware heads: In addition to shape and pose heads, MOD predicts a binary contact mask for each object pair.
- Physical loss: A differentiable penalty term discourages mesh intersections and encourages predicted contacts to match the ground‑truth contact map.
Training & Inference
- The model is trained end‑to‑end on MessyKitchens with a multi‑task loss (shape, pose, contact, and physical plausibility).
- At inference time, a single RGB image yields a full 3‑D scene reconstruction in under 200 ms on an RTX 3080 GPU.

Results & Findings

Dataset	Pose/Shape IoU ↑	Avg. Penetration Volume ↓
MessyKitchens (baseline SAM‑3D)	0.62	0.018 m³
MOD (ours)	0.71 (+14 %)	0.009 m³ (‑50 %)
ScanNet	0.58 → 0.66	0.022 m³ → 0.011 m³
3RScan	0.55 → 0.63	0.025 m³ → 0.012 m³

Registration accuracy improves consistently across all test sets, confirming that joint reasoning helps resolve occlusions.
Contact prediction achieves an average F1‑score of 0.84, meaning the model reliably identifies where objects touch.
Runtime remains real‑time, showing that the added multi‑object reasoning does not sacrifice speed.

Practical Implications

Robotics & Manipulation – Robots can now infer not just where objects are, but also how they support each other, enabling safer grasp planning and better scene rearrangement.
AR/VR Content Creation – Developers can generate physically plausible 3‑D assets from a single photo, dramatically cutting the time needed for manual mesh editing.
Game Engine Integration – MOD’s contact map can be fed directly into physics engines (e.g., Unity, Unreal) to auto‑generate collision meshes that respect real‑world contacts.
E‑commerce & Virtual Staging – Retailers can reconstruct cluttered product displays from catalog images, allowing customers to explore realistic 3‑D room layouts.

Limitations & Future Work

Domain Specificity – The dataset focuses on kitchen environments; performance on highly structured or outdoor scenes remains untested.
Single‑View Ambiguity – Extremely heavy occlusions still cause shape hallucinations; incorporating multi‑view or depth cues could improve robustness.
Contact Granularity – Current contact masks are binary; future work may model friction, compliance, or dynamic forces for richer physical simulation.

The MessyKitchens project marks a significant step toward truly contact‑aware 3‑D scene reconstruction, opening new avenues for developers building perception‑driven applications. The open‑source release ensures that the community can build on this foundation right away.

Authors

Junaid Ahmed Ansari
Ran Ding
Fabio Pizzati
Ivan Laptev

Paper Information

arXiv ID: 2603.16868v1
Categories: cs.CV, cs.AI, cs.RO
Published: March 17, 2026
PDF: Download PDF

[Paper] MessyKitchens: Contact-rich object-level 3D scene reconstruction

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] NavTrust: Benchmarking Trustworthiness for Embodied Navigation

[Paper] Spectrally-Guided Diffusion Noise Schedules

[Paper] DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding

[Paper] DreamPartGen: Semantically Grounded Part-Level 3D Generation via Collaborative Latent Denoising