[Paper] SAGE: Scalable Agentic 3D Scene Generation for Embodied AI

Published: 2 days ago (February 10, 2026 at 01:59 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.10116v1

Overview

The paper introduces SAGE, a novel “agentic” pipeline that can automatically generate large‑scale, simulation‑ready 3‑D scenes tailored to a user‑specified embodied task (e.g., “pick up a bowl and place it on the table”). By combining generative models with learned critics that check semantic plausibility, visual realism, and physical stability, SAGE produces environments that are both diverse and immediately usable for training embodied AI agents, dramatically reducing the need for costly real‑world data collection.

Key Contributions

Task‑driven scene synthesis – Generates entire 3‑D environments conditioned on a high‑level task description rather than generic layout priors.
Agentic iterative refinement – An autonomous loop that selects and invokes specialized generators (layout, object placement, texture) and critics, self‑correcting until all constraints are satisfied.
Multi‑aspect critics – Learned evaluators for semantic consistency, photorealism, and physics validity that guide the refinement process.
SAGE‑10k dataset – A publicly released collection of 10,000 diverse, task‑aligned scenes ready for import into popular simulators (e.g., Habitat, AI2‑Thor).
Empirical scaling study – Demonstrates that policies trained solely on SAGE‑generated data improve monotonically with dataset size and generalize to unseen objects and layouts.

Methodology

Task Parsing – The user’s natural‑language task is encoded using a language model, extracting intent (objects, actions, spatial relations).
Generator Suite
- Layout Generator: predicts a plausible room layout and object bounding boxes.
- Object Composer: selects 3‑D asset models, orients and scales them to fit the layout.
- Texture/Lighting Generator: adds materials and illumination for visual realism.
Critic Suite
- Semantic Critic: checks that the chosen objects and their relations match the task description.
- Visual Critic: a discriminator‑style network that scores photorealism.
- Physical Critic: runs a fast physics simulation to ensure stability (no interpenetrations, objects rest on surfaces).
Iterative Agentic Loop – The system evaluates the current scene with the critics, identifies the most violated constraint, and selects the appropriate generator to fix it. This loop repeats until all critics report scores above predefined thresholds.
Export – The final scene is exported in a simulator‑compatible format (URDF/GLTF) together with task metadata for downstream policy training.

Results & Findings

Quality metrics: Compared to rule‑based baselines, SAGE improves semantic plausibility by +23%, visual realism by +18% (measured with FID‑style scores), and physical stability by +31% (fewer interpenetrations).
Policy performance: Agents trained on SAGE‑generated environments achieve +12% higher success rates on the original task suite than agents trained on manually curated scenes, and they retain performance when transferred to novel objects not seen during training.
Scaling behavior: Success rates continue to rise as the dataset grows from 1k to 10k scenes, indicating that the synthetic data can replace expensive real‑world collection for many tasks.
Ablation: Removing any critic degrades the final scene quality dramatically (e.g., dropping the physical critic leads to a 45% increase in unstable scenes), confirming the necessity of the multi‑critic feedback loop.

Practical Implications

Rapid prototyping – Developers can spin up task‑specific simulation environments with a single sentence, cutting weeks of manual scene authoring.
Data‑centric AI pipelines – Large‑scale synthetic datasets can be generated on demand, enabling continuous integration of new tasks and objects without manual labeling.
Cross‑simulator compatibility – Export formats work out‑of‑the‑box with Habitat, AI2‑Thor, and Unity‑based simulators, easing integration into existing RL training pipelines.
Safety & cost reduction – By training policies in SAGE‑generated worlds before real‑world deployment, companies can lower the risk of hardware damage and reduce the need for costly physical data collection rigs.
Customization – The agentic loop can be extended with domain‑specific generators (e.g., kitchen appliances, warehouse shelves) to target niche industries such as robotics for logistics or home assistance.

Limitations & Future Work

Asset library dependence – The realism of generated scenes is bounded by the diversity of the underlying 3‑D asset repository; rare or highly specialized objects may still need manual modeling.
Computation cost – The iterative refinement loop can be compute‑intensive for high‑resolution scenes, limiting on‑the‑fly generation for very large environments.
Generalization to dynamic tasks – Current work focuses on static layout generation; extending SAGE to synthesize dynamic elements (e.g., moving agents, fluid simulations) is an open direction.
User intent ambiguity – Ambiguous natural‑language prompts can lead to unintended scene configurations; future versions could incorporate clarification dialogs or multimodal inputs (sketches, reference images).

The authors provide code, demos, and the SAGE‑10k dataset on their project page, making it straightforward for developers to experiment and integrate the system into their own embodied AI workflows.

Authors

Hongchi Xia
Xuan Li
Zhaoshuo Li
Qianli Ma
Jiashu Xu
Ming-Yu Liu
Yin Cui
Tsung-Yi Lin
Wei-Chiu Ma
Shenlong Wang
Shuran Song
Fangyin Wei

Paper Information

arXiv ID: 2602.10116v1
Categories: cs.CV, cs.RO
Published: February 10, 2026
PDF: Download PDF

[Paper] SAGE: Scalable Agentic 3D Scene Generation for Embodied AI

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Stroke of Surprise: Progressive Semantic Illusions in Vector Sketching

[Paper] UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

[Paper] MonarchRT: Efficient Attention for Real-Time Video Generation

[Paper] Energy-Aware Spike Budgeting for Continual Learning in Spiking Neural Networks for Neuromorphic Vision