[Paper] SAGE: Scalable Agentic 3D Scene Generation for Embodied AI

Published: (February 10, 2026 at 01:59 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.10116v1

Overview

The paper introduces SAGE, a novel “agentic” pipeline that can automatically generate large‑scale, simulation‑ready 3‑D scenes tailored to a user‑specified embodied task (e.g., “pick up a bowl and place it on the table”). By combining generative models with learned critics that check semantic plausibility, visual realism, and physical stability, SAGE produces environments that are both diverse and immediately usable for training embodied AI agents, dramatically reducing the need for costly real‑world data collection.

Key Contributions

  • Task‑driven scene synthesis – Generates entire 3‑D environments conditioned on a high‑level task description rather than generic layout priors.
  • Agentic iterative refinement – An autonomous loop that selects and invokes specialized generators (layout, object placement, texture) and critics, self‑correcting until all constraints are satisfied.
  • Multi‑aspect critics – Learned evaluators for semantic consistency, photorealism, and physics validity that guide the refinement process.
  • SAGE‑10k dataset – A publicly released collection of 10,000 diverse, task‑aligned scenes ready for import into popular simulators (e.g., Habitat, AI2‑Thor).
  • Empirical scaling study – Demonstrates that policies trained solely on SAGE‑generated data improve monotonically with dataset size and generalize to unseen objects and layouts.

Methodology

  1. Task Parsing – The user’s natural‑language task is encoded using a language model, extracting intent (objects, actions, spatial relations).
  2. Generator Suite
    • Layout Generator: predicts a plausible room layout and object bounding boxes.
    • Object Composer: selects 3‑D asset models, orients and scales them to fit the layout.
    • Texture/Lighting Generator: adds materials and illumination for visual realism.
  3. Critic Suite
    • Semantic Critic: checks that the chosen objects and their relations match the task description.
    • Visual Critic: a discriminator‑style network that scores photorealism.
    • Physical Critic: runs a fast physics simulation to ensure stability (no interpenetrations, objects rest on surfaces).
  4. Iterative Agentic Loop – The system evaluates the current scene with the critics, identifies the most violated constraint, and selects the appropriate generator to fix it. This loop repeats until all critics report scores above predefined thresholds.
  5. Export – The final scene is exported in a simulator‑compatible format (URDF/GLTF) together with task metadata for downstream policy training.

Results & Findings

  • Quality metrics: Compared to rule‑based baselines, SAGE improves semantic plausibility by +23%, visual realism by +18% (measured with FID‑style scores), and physical stability by +31% (fewer interpenetrations).
  • Policy performance: Agents trained on SAGE‑generated environments achieve +12% higher success rates on the original task suite than agents trained on manually curated scenes, and they retain performance when transferred to novel objects not seen during training.
  • Scaling behavior: Success rates continue to rise as the dataset grows from 1k to 10k scenes, indicating that the synthetic data can replace expensive real‑world collection for many tasks.
  • Ablation: Removing any critic degrades the final scene quality dramatically (e.g., dropping the physical critic leads to a 45% increase in unstable scenes), confirming the necessity of the multi‑critic feedback loop.

Practical Implications

  • Rapid prototyping – Developers can spin up task‑specific simulation environments with a single sentence, cutting weeks of manual scene authoring.
  • Data‑centric AI pipelines – Large‑scale synthetic datasets can be generated on demand, enabling continuous integration of new tasks and objects without manual labeling.
  • Cross‑simulator compatibility – Export formats work out‑of‑the‑box with Habitat, AI2‑Thor, and Unity‑based simulators, easing integration into existing RL training pipelines.
  • Safety & cost reduction – By training policies in SAGE‑generated worlds before real‑world deployment, companies can lower the risk of hardware damage and reduce the need for costly physical data collection rigs.
  • Customization – The agentic loop can be extended with domain‑specific generators (e.g., kitchen appliances, warehouse shelves) to target niche industries such as robotics for logistics or home assistance.

Limitations & Future Work

  • Asset library dependence – The realism of generated scenes is bounded by the diversity of the underlying 3‑D asset repository; rare or highly specialized objects may still need manual modeling.
  • Computation cost – The iterative refinement loop can be compute‑intensive for high‑resolution scenes, limiting on‑the‑fly generation for very large environments.
  • Generalization to dynamic tasks – Current work focuses on static layout generation; extending SAGE to synthesize dynamic elements (e.g., moving agents, fluid simulations) is an open direction.
  • User intent ambiguity – Ambiguous natural‑language prompts can lead to unintended scene configurations; future versions could incorporate clarification dialogs or multimodal inputs (sketches, reference images).

The authors provide code, demos, and the SAGE‑10k dataset on their project page, making it straightforward for developers to experiment and integrate the system into their own embodied AI workflows.

Authors

  • Hongchi Xia
  • Xuan Li
  • Zhaoshuo Li
  • Qianli Ma
  • Jiashu Xu
  • Ming-Yu Liu
  • Yin Cui
  • Tsung-Yi Lin
  • Wei-Chiu Ma
  • Shenlong Wang
  • Shuran Song
  • Fangyin Wei

Paper Information

  • arXiv ID: 2602.10116v1
  • Categories: cs.CV, cs.RO
  • Published: February 10, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »