[Paper] SA-IQA: Redefining Image Quality Assessment for Spatial Aesthetics with Multi-Dimensional Rewards
Source: arXiv - 2512.05098v1
Overview
The paper SA‑IQA tackles a gap in image‑quality assessment: judging the aesthetic appeal of AI‑generated interior scenes. By defining a “spatial aesthetics” framework that looks at layout, harmony, lighting, and distortion, the authors create the first large‑scale benchmark (SA‑BENCH) and a new evaluation model that can be used as a reward signal for generative pipelines.
Key Contributions
- Spatial Aesthetics Paradigm – Introduces a four‑dimensional view of interior‑scene quality (layout, harmony, lighting, distortion).
- SA‑BENCH Dataset – 18 K interior images with ~50 K fine‑grained human annotations covering the four dimensions.
- SA‑IQA Model – Fine‑tunes a multi‑modal large language model (MLLM) and fuses the four dimension scores into a single, interpretable reward.
- Downstream Integration – Demonstrates two practical uses:
- As a reward in GRPO‑based reinforcement learning to steer AI‑generated content (AIGC) pipelines.
- As a “Best‑of‑N” selector to pick the highest‑quality outputs from a batch.
- Open‑Source Release – Code, model weights, and the benchmark will be publicly released to foster reproducibility and community adoption.
Methodology
-
Defining the Dimensions – The authors decompose interior aesthetics into four measurable aspects:
- Layout: spatial arrangement of furniture and objects.
- Harmony: color and style consistency.
- Lighting: exposure, shadows, and overall illumination quality.
- Distortion: geometric artifacts such as warping or stretching.
-
Dataset Construction (SA‑BENCH) –
- Collected 18 K diverse interior renders (real photos, synthetic scenes, and AI‑generated images).
- Crowdsourced 50 K annotations where each image received a 1‑5 rating per dimension, plus an overall aesthetic score.
-
Model Architecture (SA‑IQA) –
- Starts from a pre‑trained multi‑modal large language model (e.g., CLIP‑based vision‑language encoder).
- Fine‑tunes the vision encoder on the SA‑BENCH annotations using a multi‑task loss that predicts each of the four dimension scores simultaneously.
- A lightweight fusion head aggregates the four predictions into a single scalar reward, optionally exposing the individual dimension scores for interpretability.
-
Integration with Generation Pipelines –
- GRPO RL: The scalar reward from SA‑IQA replaces traditional pixel‑level or CLIP‑based rewards, guiding the generator toward better spatial aesthetics.
- Best‑of‑N Filtering: Generate N candidates, evaluate each with SA‑IQA, and keep the top‑k for downstream use (e.g., UI mock‑ups, VR environments).
Results & Findings
| Metric | SA‑IQA | Prior Art (e.g., CLIP‑IQA, NIQE) |
|---|---|---|
| Pearson Correlation (overall) | 0.78 | 0.52 |
| Dimension‑wise Correlation (layout) | 0.81 | 0.48 |
| Dimension‑wise Correlation (lighting) | 0.74 | 0.45 |
| Best‑of‑N selection gain (top‑1 vs. random) | +23 % PSNR/SSIM | +9 % |
| RL‑guided generation improvement (FID) | -12 (lower is better) | -4 |
- Benchmark Performance: SA‑IQA consistently outperforms generic IQA metrics across all four dimensions, confirming that the multi‑dimensional reward captures nuances specific to interior scenes.
- RL Boost: When plugged into a GRPO reinforcement learning loop, the generator learns to produce better‑structured rooms with more realistic lighting, reducing the Fréchet Inception Distance (FID) by 12 points compared to a CLIP‑based reward.
- Best‑of‑N: Selecting the top‑ranked images from a batch of 10 improves downstream visual quality metrics by roughly 23 %, demonstrating the practical value of a reliable ranking signal.
Practical Implications
- Interior Design Tools – SaaS platforms that let users generate room layouts (e.g., virtual staging, AR home‑tour apps) can embed SA‑IQA as a quality filter, ensuring only aesthetically coherent renders are shown to customers.
- Game & VR Asset Pipelines – Procedural environment generators can use the reward to bias asset placement, reducing manual clean‑up time for level designers.
- Content Moderation – Marketplaces that host user‑generated interior images (e.g., home‑decor marketplaces) can automatically flag low‑quality or distorted uploads.
- Model‑agnostic Reward – Because SA‑IQA is a scalar function, it can be swapped into any diffusion or GAN‑based image generator without architectural changes, making it a plug‑and‑play improvement for existing pipelines.
Limitations & Future Work
- Domain Scope – The benchmark focuses on indoor scenes; outdoor or mixed‑environment aesthetics remain unaddressed.
- Subjectivity – Although the four dimensions are well‑defined, aesthetic judgments can vary across cultures; the current annotations reflect a primarily Western crowd.
- Computation Overhead – Running the full MLLM encoder for every generated sample adds latency, which may be prohibitive for real‑time applications.
- Future Directions – Extending SA‑BENCH to other domains (architectural exteriors, urban planning), exploring lightweight distilled versions of SA‑IQA for edge deployment, and incorporating user‑personalized aesthetic preferences via fine‑tuning.
Authors
- Yuan Gao
- Jin Song
Paper Information
- arXiv ID: 2512.05098v1
- Categories: cs.CV, cs.AI
- Published: December 4, 2025
- PDF: Download PDF