[Paper] SceneCritic: A Symbolic Evaluator for 3D Indoor Scene Synthesis

Published: 3 weeks ago (April 14, 2026 at 01:59 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.13035v1

Overview

The paper introduces SceneCritic, a symbolic, rule‑based evaluator that checks the plausibility of 3‑D indoor scene layouts at the floor‑plan level. By grounding its constraints in a newly built spatial ontology (SceneOnto), SceneCritic can automatically flag semantic, orientation, and geometric errors—something that current LLM/VLM judges struggle with because they depend on rendered images and are highly sensitive to viewpoint and prompt wording.

Key Contributions

SceneOnto – a unified spatial ontology compiled from 3D‑FRONT, ScanNet, and Visual Genome that encodes common indoor object relationships, orientations, and size constraints.
SceneCritic – a symbolic evaluator that traverses SceneOnto to verify layout coherence, delivering fine‑grained, object‑level diagnostics instead of a single scalar score.
Critic Modalities Benchmark – an experimental test‑bed that compares three feedback loops for iterative scene synthesis:
1. rule‑based collision constraints,
2. text‑only LLM critic,
3. image‑based VLM critic.
Human‑Alignment Study – empirical evidence that SceneCritic’s scores correlate far more closely with human judgments than existing VLM‑based evaluators.
Insightful Findings – text‑only LLMs surprisingly outperform VLMs on pure semantic layout quality, while VLM‑driven refinement excels at fixing orientation and spatial alignment issues.

Methodology

1. Data Fusion & Ontology Construction

Extracted object co‑occurrence, typical orientations (e.g., “sofa faces TV”), and size statistics from three large datasets.
Normalized and merged these priors into a graph‑structured ontology (SceneOnto) where nodes are object categories and edges encode relational constraints (e.g., “must be adjacent”, “cannot overlap”).

2. Symbolic Evaluation Engine (SceneCritic)

Input: a floor‑plan layout expressed as a list of objects with class, position, and orientation.
Checks three families of constraints:
- Semantic – is the object plausible in the given room context?
- Orientation – are directional relationships satisfied (e.g., “bed head against wall”)?
- Geometric – are there collisions or impossible size ratios?
Output: a structured report containing per‑object pass/fail flags and the specific violated rule.

Rule‑based critic – feeds back collision violations as hard constraints.
LLM critic – serializes the layout into natural‑language statements; an LLM suggests edits.
VLM critic – renders the layout from multiple viewpoints, feeds images to a vision‑language model, and receives corrective suggestions.

4. Evaluation

Collected human ratings on a subset of generated scenes.
Measured correlation (Spearman’s ρ) between each evaluator’s scores and human judgments.
Compared final layout quality after a fixed number of refinement iterations per critic modality.

Results & Findings

Evaluator	Correlation with Human Scores	Semantic Quality ↑	Orientation / Geometry ↑
SceneCritic (symbolic)	0.78	0.81	0.74
VLM‑based evaluator	0.45	0.48	0.42
LLM‑only (text)	0.62	0.85	0.55
VLM‑driven refinement (final layout)	–	0.78	0.81

Alignment: SceneCritic’s scores align substantially better with human perception than any VLM‑only metric.
Semantic Edge: Pure text LLMs (e.g., GPT‑4) capture object‑type plausibility without visual input, outperforming VLMs on that dimension.
Orientation Fixes: When the critic operates on rendered images, the model learns to correct facing directions and collision issues more effectively than rule‑only feedback.
Iterative Gains: After three refinement cycles, VLM‑driven feedback yields the highest combined semantic‑orientation score, while rule‑based feedback quickly eliminates gross collisions but plateaus on higher‑level semantics.

Practical Implications

Robust Automated QA for Asset Pipelines – game studios and AR/VR developers can plug SceneCritic into procedural generation pipelines to catch impossible object placements before costly rendering or physics simulation.
Debug‑Friendly Feedback – because SceneCritic returns explicit rule violations, developers receive actionable diagnostics (“sofa overlaps wall”, “lamp not facing desk”) instead of opaque confidence scores.
Hybrid Generation Strategies – a two‑stage approach is suggested: use an LLM to draft a semantically sound layout, then hand it to a VLM‑based refinement loop for fine‑grained orientation and collision fixes.
Dataset‑Driven Ontology Updates – the ontology can be refreshed with new domain‑specific priors (e.g., office vs. residential) to tailor the evaluator for specialized interior‑design tools.
Benchmark Standardization – SceneCritic offers a reproducible, viewpoint‑independent metric that could become a community benchmark for 3‑D scene synthesis research, reducing reliance on noisy human‑in‑the‑loop evaluations.

Limitations & Future Work

Ontology Coverage – SceneOnto is limited to the object categories present in the three source datasets; exotic or custom assets may lack appropriate constraints.
Floor‑Plan Focus – the evaluator operates at the 2‑D layout level and does not directly assess 3‑D details such as mesh quality, material realism, or lighting.
Scalability of Textual Conversion – translating large, complex scenes into natural‑language prompts for LLM critics can become verbose and may lose nuance.
Future Directions – extending the ontology to incorporate functional affordances (e.g., “chair must be reachable from desk”), integrating multi‑modal feedback loops (simultaneous LLM + VLM), and exploring learned symbolic constraints that adapt from user‑generated correction data.

Authors

Kathakoli Sengupta
Kai Ao
Paola Cascante‑Bonilla

Paper Information

arXiv ID: 2604.13035v1
Categories: cs.CV, cs.CL
Published: April 14, 2026
PDF: Download PDF

[Paper] SceneCritic: A Symbolic Evaluator for 3D Indoor Scene Synthesis

Overview

Key Contributions

Methodology

1. Data Fusion & Ontology Construction

2. Symbolic Evaluation Engine (SceneCritic)

3. Iterative Refinement Test‑bed

4. Evaluation

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

[Paper] Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

[Paper] MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

[Paper] SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments