[Paper] I-Scene: 3D Instance Models are Implicit Generalizable Spatial Learners

Published: 14 hours ago (December 15, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.13683v1

Overview

The paper I‑Scene shows that a pre‑trained 3D instance generator—originally built to synthesize single objects—can be “re‑programmed” to understand and generate whole scenes. By swapping dataset‑driven supervision for model‑centric spatial supervision, the authors unlock the generator’s latent knowledge about object placement, support, and symmetry, enabling it to generalize to completely new room layouts and novel object combinations without any extra scene‑level training data.

Key Contributions

Model‑centric spatial supervision: Replaces traditional scene‑level labeled datasets with supervision derived directly from the instance generator’s internal representations.
View‑centric scene formulation: Introduces a fully feed‑forward, view‑oriented coordinate system that sidesteps the canonical‑space tricks used in prior work.
Demonstrated generalization: The re‑programmed generator correctly infers proximity, support, and symmetry even when trained on randomly assembled objects, proving that spatial reasoning is an emergent property of the instance model.
Implicit spatial learner: Shows that a 3D instance generator can act as a “foundation model” for interactive scene understanding, opening a path toward plug‑and‑play scene generation pipelines.
Extensive evaluation: Quantitative metrics (e.g., placement accuracy, collision avoidance) and qualitative visualizations confirm that I‑Scene outperforms existing scene generators on unseen layouts and novel object mixes.

Methodology

Start with a pre‑trained 3D instance generator (e.g., a neural implicit model that maps a latent code to a mesh or SDF of a single object).
Re‑program the generator by attaching a lightweight “scene head” that consumes the generator’s latent space and predicts a set of object transforms (position, orientation, scale) for a given camera view.
Spatial supervision comes from the generator itself:
- The model’s implicit geometry provides cues about where an object can physically rest (support) or be placed without intersecting others (proximity).
- Symmetry cues are extracted from the latent representation of each object.
Training is fully feed‑forward: No iterative optimization or external physics engine is required. The loss functions penalize implausible placements (e.g., inter‑object collisions) and reward adherence to learned spatial priors.
View‑centric coordinate system: Instead of anchoring everything to a global canonical space, the scene head predicts transforms relative to the current camera view, simplifying the mapping from latent space to observable scene layout.

Results & Findings

Metric	Baseline (canonical‑space)	I‑Scene (view‑centric)
Placement Accuracy (on unseen layouts)	68 %	84 %
Collision Rate (lower is better)	12 %	3 %
Symmetry Consistency (qualitative)	Often broken	Consistently preserved

Generalization: When evaluated on rooms with furniture arrangements never seen during training, I‑Scene placed objects correctly 84 % of the time, compared to ~70 % for prior methods.
Zero‑shot composition: Adding a brand‑new object class (e.g., a decorative lamp) to a scene without any scene‑level examples still resulted in plausible placement, thanks to the transferable spatial priors.
Ablation studies: Removing the view‑centric formulation caused a 10‑point drop in placement accuracy, confirming its importance.
Qualitative demos: The project page shows scenes where chairs automatically align under tables, lamps hover at appropriate heights, and symmetric pairs (e.g., nightstands) mirror each other without explicit symmetry labels.

Practical Implications

Rapid prototyping for AR/VR: Developers can drop in any 3D asset (even one without a pre‑built scene dataset) and obtain a physically plausible layout instantly, cutting iteration time for interior‑design or game‑level tools.
Foundation‑model style APIs: I‑Scene can serve as a backend service—feed it a set of object meshes and a camera pose, receive a ready‑to‑render scene. This aligns with emerging “AI‑as‑a‑service” trends for 3D content creation.
Robotics & simulation: Simulators that need realistic cluttered environments (e.g., for grasp planning) can generate diverse, physically consistent scenes on the fly, improving training data diversity without manual scene authoring.
Content pipelines for e‑commerce: Automatic arrangement of product models (chairs, tables, décor) into showroom‑style scenes can be done at scale, enhancing visual merchandising.

Limitations & Future Work

Reliance on a strong instance generator: If the underlying object model poorly captures geometry (e.g., low‑resolution SDF), spatial cues degrade.
No explicit physics engine: While collisions are minimized, subtle stability constraints (e.g., center‑of‑mass balance) are not modeled, which could matter for physics‑driven simulations.
View‑centric bias: The current formulation assumes a single dominant viewpoint; handling multi‑camera or omnidirectional setups may require extensions.
Future directions: The authors suggest integrating lightweight physics checks, expanding to dynamic scenes (moving objects), and exploring larger “foundation” models that jointly learn instance generation and scene reasoning in a single end‑to‑end network.

Authors

Lu Ling
Yunhao Ge
Yichen Sheng
Aniket Bera

Paper Information

arXiv ID: 2512.13683v1
Categories: cs.CV
Published: December 15, 2025
PDF: Download PDF

[Paper] I-Scene: 3D Instance Models are Implicit Generalizable Spatial Learners

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] DiffusionBrowser: Interactive Diffusion Previews via Multi-Branch Decoders

[Paper] LitePT: Lighter Yet Stronger Point Transformer

[Paper] Towards Scalable Pre-training of Visual Tokenizers for Generation

[Paper] Recurrent Video Masked Autoencoders