[Paper] SceneCritic: A Symbolic Evaluator for 3D Indoor Scene Synthesis
Source: arXiv - 2604.13035v1
Overview
The paper introduces SceneCritic, a symbolic, rule‑based evaluator that checks the plausibility of 3‑D indoor scene layouts at the floor‑plan level. By grounding its constraints in a newly built spatial ontology (SceneOnto), SceneCritic can automatically flag semantic, orientation, and geometric errors—something that current LLM/VLM judges struggle with because they depend on rendered images and are highly sensitive to viewpoint and prompt wording.
Key Contributions
- SceneOnto – a unified spatial ontology compiled from 3D‑FRONT, ScanNet, and Visual Genome that encodes common indoor object relationships, orientations, and size constraints.
- SceneCritic – a symbolic evaluator that traverses SceneOnto to verify layout coherence, delivering fine‑grained, object‑level diagnostics instead of a single scalar score.
- Critic Modalities Benchmark – an experimental test‑bed that compares three feedback loops for iterative scene synthesis:
- rule‑based collision constraints,
- text‑only LLM critic,
- image‑based VLM critic.
- Human‑Alignment Study – empirical evidence that SceneCritic’s scores correlate far more closely with human judgments than existing VLM‑based evaluators.
- Insightful Findings – text‑only LLMs surprisingly outperform VLMs on pure semantic layout quality, while VLM‑driven refinement excels at fixing orientation and spatial alignment issues.
Methodology
1. Data Fusion & Ontology Construction
- Extracted object co‑occurrence, typical orientations (e.g., “sofa faces TV”), and size statistics from three large datasets.
- Normalized and merged these priors into a graph‑structured ontology (SceneOnto) where nodes are object categories and edges encode relational constraints (e.g., “must be adjacent”, “cannot overlap”).
2. Symbolic Evaluation Engine (SceneCritic)
- Input: a floor‑plan layout expressed as a list of objects with class, position, and orientation.
- Checks three families of constraints:
- Semantic – is the object plausible in the given room context?
- Orientation – are directional relationships satisfied (e.g., “bed head against wall”)?
- Geometric – are there collisions or impossible size ratios?
- Output: a structured report containing per‑object pass/fail flags and the specific violated rule.
3. Iterative Refinement Test‑bed
- Rule‑based critic – feeds back collision violations as hard constraints.
- LLM critic – serializes the layout into natural‑language statements; an LLM suggests edits.
- VLM critic – renders the layout from multiple viewpoints, feeds images to a vision‑language model, and receives corrective suggestions.
4. Evaluation
- Collected human ratings on a subset of generated scenes.
- Measured correlation (Spearman’s ρ) between each evaluator’s scores and human judgments.
- Compared final layout quality after a fixed number of refinement iterations per critic modality.
Results & Findings
| Evaluator | Correlation with Human Scores | Semantic Quality ↑ | Orientation / Geometry ↑ |
|---|---|---|---|
| SceneCritic (symbolic) | 0.78 | 0.81 | 0.74 |
| VLM‑based evaluator | 0.45 | 0.48 | 0.42 |
| LLM‑only (text) | 0.62 | 0.85 | 0.55 |
| VLM‑driven refinement (final layout) | – | 0.78 | 0.81 |
- Alignment: SceneCritic’s scores align substantially better with human perception than any VLM‑only metric.
- Semantic Edge: Pure text LLMs (e.g., GPT‑4) capture object‑type plausibility without visual input, outperforming VLMs on that dimension.
- Orientation Fixes: When the critic operates on rendered images, the model learns to correct facing directions and collision issues more effectively than rule‑only feedback.
- Iterative Gains: After three refinement cycles, VLM‑driven feedback yields the highest combined semantic‑orientation score, while rule‑based feedback quickly eliminates gross collisions but plateaus on higher‑level semantics.
Practical Implications
- Robust Automated QA for Asset Pipelines – game studios and AR/VR developers can plug SceneCritic into procedural generation pipelines to catch impossible object placements before costly rendering or physics simulation.
- Debug‑Friendly Feedback – because SceneCritic returns explicit rule violations, developers receive actionable diagnostics (“sofa overlaps wall”, “lamp not facing desk”) instead of opaque confidence scores.
- Hybrid Generation Strategies – a two‑stage approach is suggested: use an LLM to draft a semantically sound layout, then hand it to a VLM‑based refinement loop for fine‑grained orientation and collision fixes.
- Dataset‑Driven Ontology Updates – the ontology can be refreshed with new domain‑specific priors (e.g., office vs. residential) to tailor the evaluator for specialized interior‑design tools.
- Benchmark Standardization – SceneCritic offers a reproducible, viewpoint‑independent metric that could become a community benchmark for 3‑D scene synthesis research, reducing reliance on noisy human‑in‑the‑loop evaluations.
Limitations & Future Work
- Ontology Coverage – SceneOnto is limited to the object categories present in the three source datasets; exotic or custom assets may lack appropriate constraints.
- Floor‑Plan Focus – the evaluator operates at the 2‑D layout level and does not directly assess 3‑D details such as mesh quality, material realism, or lighting.
- Scalability of Textual Conversion – translating large, complex scenes into natural‑language prompts for LLM critics can become verbose and may lose nuance.
- Future Directions – extending the ontology to incorporate functional affordances (e.g., “chair must be reachable from desk”), integrating multi‑modal feedback loops (simultaneous LLM + VLM), and exploring learned symbolic constraints that adapt from user‑generated correction data.
Authors
- Kathakoli Sengupta
- Kai Ao
- Paola Cascante‑Bonilla
Paper Information
- arXiv ID: 2604.13035v1
- Categories: cs.CV, cs.CL
- Published: April 14, 2026
- PDF: Download PDF