[Paper] GEBench: Benchmarking Image Generation Models as GUI Environments

Published: 3 days ago (February 9, 2026 at 01:52 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.09007v1

Overview

The paper introduces GEBench, a new benchmark designed to evaluate how well image‑generation models can predict and render future GUI screens after a user’s action. Unlike existing visual‑quality tests that focus on static images, GEBench stresses state transitions and temporal coherence—the ability to keep a UI logically consistent across a sequence of interactions.

Key Contributions

GEBench dataset: 700 curated examples covering five interaction categories (single‑step, multi‑step trajectories, real‑world vs. fictional apps, and point‑level grounding).
GE‑Score metric: A five‑dimensional evaluation suite that measures
1. Goal Achievement – does the generated screen satisfy the user instruction?
2. Interaction Logic – are UI state changes (e.g., button press, navigation) plausible?
3. Content Consistency – are existing UI elements preserved correctly across steps?
4. UI Plausibility – does the overall layout follow real‑world design conventions?
5. Visual Quality – traditional fidelity (sharpness, color, text legibility).
Comprehensive baseline study: Benchmarking several state‑of‑the‑art diffusion and transformer‑based generators, revealing strengths on single‑step tasks but major drops in multi‑step coherence.
Open‑source release: Code, data, and evaluation scripts are publicly available (https://github.com/stepfun‑ai/GEBench), encouraging reproducibility and community extensions.

Methodology

Dataset construction – The authors collected UI screenshots from open‑source mobile/web apps and generated synthetic “fictional” interfaces. Each sample includes:
- An initial GUI image,
- A textual user instruction (e.g., “tap the “Add to Cart” button”),
- The target GUI after the action, and for multi‑step cases, a chain of intermediate states.
Model interface – Any conditional image‑generation model that accepts a GUI + instruction pair can be plugged in. The model predicts the next frame, which is then fed back as input for the next step in a trajectory.
Scoring pipeline – GE‑Score combines automated metrics (e.g., CLIP similarity for visual quality, layout detectors for UI plausibility) with lightweight task‑specific classifiers (goal achievement, interaction logic). Human verification on a subset validates the automatic scores.
Baseline evaluation – The study runs popular generators (Stable Diffusion, DALL‑E 3, Layout‑aware diffusion) under identical prompts and measures performance across all five GE‑Score dimensions.

Results & Findings

Dimension	Best baseline	Typical performance drop (multi‑step)
Goal Achievement	78 % (single‑step)	↓ to 42 % on 5‑step trajectories
Interaction Logic	71 %	↓ to 35 %
Content Consistency	84 %	↓ to 38 %
UI Plausibility	90 %	modest drop to 80 %
Visual Quality	88 % (FID‑like)	stable across steps

Key takeaways

Single‑step generation is already fairly reliable; models can render the correct screen after a single command.
Temporal coherence collapses after 2–3 steps: icons disappear, text is garbled, and layout drifts.
Bottlenecks identified are (a) accurate interpretation of UI icons (e.g., distinguishing a “share” vs. “save” glyph), (b) crisp text rendering (especially variable‑length labels), and (c) precise spatial grounding for point‑level instructions (e.g., “click the top‑right corner”).

Practical Implications

Prototyping tools: Developers building AI‑assisted UI mockup generators can use GEBench to benchmark not just the look of a single screen but the interactive flow of an app, leading to more usable design assistants.
Automated testing: QA pipelines could integrate generative models that synthesize plausible UI states for edge‑case testing; GEBench provides a sanity check that these synthetic states remain consistent over a test script.
Low‑code/no‑code platforms: Embedding a generative backend that respects interaction logic can enable end‑users to “talk” to a UI builder (e.g., “add a dropdown below the search bar”). The benchmark highlights where current models would fail, guiding engineering focus.
Accessibility & localization: Since text rendering is a weak spot, developers aiming to auto‑generate multilingual interfaces must invest in specialized text‑aware diffusion or post‑processing pipelines.

Limitations & Future Work

Domain coverage: GEBench focuses on mobile/web UI patterns; specialized domains (e.g., automotive dashboards, VR interfaces) are not represented.
Metric reliance on classifiers: Some GE‑Score components depend on pretrained detectors that may inherit biases from their training data.
Scalability of multi‑step evaluation: Longer interaction chains (>10 steps) are not yet included, limiting insight into very deep workflows.
Future directions suggested by the authors include expanding to 3‑D UI environments, incorporating user interaction timing (latency), and developing hybrid models that combine layout graphs with diffusion for better icon and text fidelity.

Authors

Haodong Li
Jingwei Wu
Quan Sun
Guopeng Li
Juanxi Tian
Huanyu Zhang
Yanlin Lai
Ruichuan An
Hongbo Peng
Yuhong Dai
Chenxi Li
Chunmei Qing
Jia Wang
Ziyang Meng
Zheng Ge
Xiangyu Zhang
Daxin Jiang

Paper Information

arXiv ID: 2602.09007v1
Categories: cs.AI, cs.CV
Published: February 9, 2026
PDF: Download PDF

[Paper] GEBench: Benchmarking Image Generation Models as GUI Environments

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

[Paper] GENIUS: Generative Fluid Intelligence Evaluation Suite

[Paper] From Circuits to Dynamics: Understanding and Stabilizing Failure in 3D Diffusion Transformers

[Paper] First International StepUP Competition for Biometric Footstep Recognition: Methods, Results and Remaining Challenges