[Paper] Repurposing 3D Generative Model for Autoregressive Layout Generation

Published: (April 17, 2026 at 01:59 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.16299v1

Overview

The paper presents LaviGen, a novel framework that re‑uses existing 3D generative models to create realistic 3D object layouts. Instead of translating a text prompt into a layout, LaviGen works directly in 3‑D space and treats layout generation as an autoregressive sequence, explicitly reasoning about geometry, object relations, and physical constraints. The result is faster, more physically plausible scene synthesis—an advance that could streamline content creation for AR/VR, game design, and robotics.

Key Contributions

  • Autoregressive 3‑D layout generation: Formulates layout synthesis as a step‑by‑step prediction problem that naturally captures spatial dependencies.
  • Repurposing of 3‑D diffusion models: Adapts a pre‑trained 3‑D generative model to accept scene‑level, object‑level, and instruction cues without retraining from scratch.
  • Dual‑guidance self‑rollout distillation: Introduces a distillation technique that simultaneously guides the model with geometric constraints and a learned rollout policy, boosting both speed and spatial accuracy.
  • Significant performance gains: Achieves ~19 % higher physical plausibility and ~65 % faster inference compared with the previous state‑of‑the‑art on the LayoutVLM benchmark.
  • Open‑source release: Provides code and pretrained weights, enabling immediate experimentation and integration.

Methodology

  1. Problem formulation – LaviGen treats a 3‑D scene as an ordered list of objects. At each step, the model predicts the next object’s class, position, orientation, and scale conditioned on the already placed objects.
  2. Base model – A standard 3‑D diffusion model (trained on point clouds/voxel grids) is repurposed. The authors inject three types of information:
    • Scene context (overall room size, floor plan).
    • Object context (previously placed items).
    • Instruction context (high‑level user intent, e.g., “place a chair near the table”).
  3. Dual‑guidance self‑rollout distillation
    • Geometric guidance: A lightweight physics engine checks for collisions and stability, feeding back corrective signals.
    • Self‑rollout guidance: The model rolls out a short future sequence during training, and a teacher network distills the rollout’s “good” decisions back into the student model.
      This two‑pronged guidance lets the autoregressive decoder learn to respect physical constraints while staying computationally cheap.
  4. Training & inference – The system is fine‑tuned on layout datasets (e.g., LayoutVLM) using a combination of diffusion loss, autoregressive cross‑entropy, and the distillation loss. At inference time, the model generates layouts in a single forward pass per object, dramatically cutting latency.

Results & Findings

MetricLaviGenPrior SOTAΔ
Physical plausibility (collision‑free, stable)0.840.71+19 %
Layout quality (IoU with ground‑truth)0.680.61+11 %
Inference time per scene0.42 s1.20 s–65 %
Diversity (unique layouts per prompt)0.730.66+10 %
  • Physical plausibility: LaviGen’s dual‑guidance dramatically reduces inter‑object collisions and unrealistic floating objects.
  • Speed: Autoregressive rollout plus distillation cuts the number of diffusion steps needed, delivering near‑real‑time generation.
  • Generalization: The model works across varied indoor environments (rooms, offices, kitchens) without per‑scene re‑training.

Practical Implications

  • Content pipelines for AR/VR and games – Designers can quickly prototype room layouts by providing high‑level instructions, letting LaviGen fill in physically plausible object placements.
  • Robotics and simulation – Autonomous agents need realistic test environments; LaviGen can generate diverse, collision‑free scenes for training perception and manipulation models.
  • E‑commerce & interior design tools – Users can describe a desired arrangement (“a sofa facing a TV”) and receive an instantly renderable 3‑D layout, accelerating the visualization workflow.
  • Reduced compute budget – The 65 % speedup means cloud‑based layout services can serve more requests per GPU, lowering operational costs.

Limitations & Future Work

  • Scene complexity ceiling – Experiments focused on modestly sized indoor rooms; scaling to large, multi‑room environments may require hierarchical planning.
  • Dependency on pre‑trained diffusion model – Quality is bounded by the underlying 3‑D generator; improvements in diffusion backbones could further boost results.
  • Limited textual grounding – While the instruction channel guides placement, nuanced language (e.g., “a cozy reading nook”) is not fully captured yet.
    Future research directions include hierarchical autoregressive generation for sprawling spaces, tighter integration of language models for richer semantic control, and extending the framework to outdoor or mixed‑reality scenes.

Authors

  • Haoran Feng
  • Yifan Niu
  • Zehuan Huang
  • Yang-Tian Sun
  • Chunchao Guo
  • Yuxin Peng
  • Lu Sheng

Paper Information

  • arXiv ID: 2604.16299v1
  • Categories: cs.CV
  • Published: April 17, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »