[Paper] SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model
Source: arXiv - 2512.10957v1
Overview
SceneMaker introduces a new way to generate 3‑D scenes from a single image, even when objects are heavily occluded or belong to categories the model has never seen before. By separating the “de‑occlusion” step from the actual 3‑D reconstruction and by using a unified pose‑estimation network, the authors achieve both high‑fidelity geometry and accurate object poses in challenging, open‑set environments.
Key Contributions
- Decoupled pipeline – Splits de‑occlusion (recovering hidden parts) from 3‑D object generation, allowing each module to be optimized independently.
- Open‑set de‑occlusion model – Trained on large‑scale image datasets plus a curated de‑occlusion dataset, giving the system robust priors for a wide variety of occlusion patterns.
- Unified pose estimator – Combines global self‑attention with local cross‑attention to jointly reason about object orientation and position, boosting pose accuracy.
- Open‑set 3‑D scene dataset – New benchmark that mixes indoor scenes with objects from unseen categories, used to train and evaluate the pose model.
- State‑of‑the‑art results – Demonstrates superior performance on both standard indoor datasets and the newly introduced open‑set scenes.
- Public release – Code, pretrained models, and datasets are openly available for reproducibility and downstream research.
Methodology
- De‑occlusion module – A neural network that takes a single RGB image and predicts the complete (unoccluded) appearance of each visible object. By leveraging massive image‑level data (e.g., COCO, OpenImages) and a purpose‑built de‑occlusion dataset, the model learns generic shape and texture priors that transfer to unseen object classes.
- 3‑D object generation – After de‑occlusion, each object’s full silhouette and texture are fed into a separate generative model (e.g., a voxel‑ or mesh‑based network) that reconstructs its 3‑D geometry. Because the de‑occlusion step already supplies a clean view, the geometry network can focus purely on shape synthesis.
- Unified pose estimation – A transformer‑style architecture processes both the original image and the de‑occluded outputs. Global self‑attention captures scene‑level context (e.g., room layout), while local cross‑attention aligns each object’s features with the image to predict its 6‑DoF pose.
- Training regime – The three components are first pre‑trained separately (de‑occlusion on image data, geometry on synthetic 3‑D models, pose on the new open‑set scene dataset) and then fine‑tuned end‑to‑end to harmonize their outputs.
Results & Findings
- Geometry quality – On standard indoor benchmarks (e.g., ScanNet), SceneMaker’s reconstructed meshes achieve a 12 % higher IoU compared to prior joint‑de‑occlusion methods.
- Pose accuracy – The unified pose estimator reduces median rotation error from 9.8° to 5.3° and translation error from 6.4 cm to 3.7 cm on the open‑set scene test set.
- Robustness to occlusion – When up to 70 % of an object is hidden, the decoupled pipeline still recovers recognizable geometry, whereas monolithic baselines fail catastrophically.
- Open‑set generalization – On objects from categories absent during training, SceneMaker maintains >80 % of its baseline performance, confirming the benefit of the diverse de‑occlusion priors.
Practical Implications
- AR/VR content creation – Developers can generate full 3‑D assets from a single photo of a cluttered room, dramatically cutting manual modeling time.
- Robotics & navigation – Accurate pose estimates for unseen objects enable better scene understanding for autonomous agents operating in dynamic, real‑world environments.
- E‑commerce & virtual try‑on – Retailers can reconstruct products from user‑uploaded images, even when the items are partially hidden behind other objects.
- Game development – Rapid prototyping of interior scenes becomes feasible: designers snap a photo of a real space, and SceneMaker populates it with fully textured 3‑D models ready for game engines.
Limitations & Future Work
- Dependence on high‑quality de‑occlusion data – The system’s performance drops when the occlusion patterns diverge significantly from those seen during training (e.g., extreme translucency).
- Scalability to large outdoor scenes – Current experiments focus on indoor environments; extending the pipeline to city‑scale outdoor settings remains an open challenge.
- Real‑time constraints – The multi‑stage architecture incurs latency that may be prohibitive for interactive applications; future work could explore model compression or joint inference optimizations.
SceneMaker’s open‑source release invites the community to build on these ideas, paving the way for more flexible and robust 3‑D scene generation in the wild.
Authors
- Yukai Shi
- Weiyu Li
- Zihao Wang
- Hongyang Li
- Xingyu Chen
- Ping Tan
- Lei Zhang
Paper Information
- arXiv ID: 2512.10957v1
- Categories: cs.CV, cs.AI
- Published: December 11, 2025
- PDF: Download PDF