[Paper] GEBench: Benchmarking Image Generation Models as GUI Environments
Source: arXiv - 2602.09007v1
Overview
The paper introduces GEBench, a new benchmark designed to evaluate how well image‑generation models can predict and render future GUI screens after a user’s action. Unlike existing visual‑quality tests that focus on static images, GEBench stresses state transitions and temporal coherence—the ability to keep a UI logically consistent across a sequence of interactions.
Key Contributions
- GEBench dataset: 700 curated examples covering five interaction categories (single‑step, multi‑step trajectories, real‑world vs. fictional apps, and point‑level grounding).
- GE‑Score metric: A five‑dimensional evaluation suite that measures
- Goal Achievement – does the generated screen satisfy the user instruction?
- Interaction Logic – are UI state changes (e.g., button press, navigation) plausible?
- Content Consistency – are existing UI elements preserved correctly across steps?
- UI Plausibility – does the overall layout follow real‑world design conventions?
- Visual Quality – traditional fidelity (sharpness, color, text legibility).
- Comprehensive baseline study: Benchmarking several state‑of‑the‑art diffusion and transformer‑based generators, revealing strengths on single‑step tasks but major drops in multi‑step coherence.
- Open‑source release: Code, data, and evaluation scripts are publicly available (https://github.com/stepfun‑ai/GEBench), encouraging reproducibility and community extensions.
Methodology
- Dataset construction – The authors collected UI screenshots from open‑source mobile/web apps and generated synthetic “fictional” interfaces. Each sample includes:
- An initial GUI image,
- A textual user instruction (e.g., “tap the “Add to Cart” button”),
- The target GUI after the action, and for multi‑step cases, a chain of intermediate states.
- Model interface – Any conditional image‑generation model that accepts a GUI + instruction pair can be plugged in. The model predicts the next frame, which is then fed back as input for the next step in a trajectory.
- Scoring pipeline – GE‑Score combines automated metrics (e.g., CLIP similarity for visual quality, layout detectors for UI plausibility) with lightweight task‑specific classifiers (goal achievement, interaction logic). Human verification on a subset validates the automatic scores.
- Baseline evaluation – The study runs popular generators (Stable Diffusion, DALL‑E 3, Layout‑aware diffusion) under identical prompts and measures performance across all five GE‑Score dimensions.
Results & Findings
| Dimension | Best baseline | Typical performance drop (multi‑step) |
|---|---|---|
| Goal Achievement | 78 % (single‑step) | ↓ to 42 % on 5‑step trajectories |
| Interaction Logic | 71 % | ↓ to 35 % |
| Content Consistency | 84 % | ↓ to 38 % |
| UI Plausibility | 90 % | modest drop to 80 % |
| Visual Quality | 88 % (FID‑like) | stable across steps |
Key takeaways
- Single‑step generation is already fairly reliable; models can render the correct screen after a single command.
- Temporal coherence collapses after 2–3 steps: icons disappear, text is garbled, and layout drifts.
- Bottlenecks identified are (a) accurate interpretation of UI icons (e.g., distinguishing a “share” vs. “save” glyph), (b) crisp text rendering (especially variable‑length labels), and (c) precise spatial grounding for point‑level instructions (e.g., “click the top‑right corner”).
Practical Implications
- Prototyping tools: Developers building AI‑assisted UI mockup generators can use GEBench to benchmark not just the look of a single screen but the interactive flow of an app, leading to more usable design assistants.
- Automated testing: QA pipelines could integrate generative models that synthesize plausible UI states for edge‑case testing; GEBench provides a sanity check that these synthetic states remain consistent over a test script.
- Low‑code/no‑code platforms: Embedding a generative backend that respects interaction logic can enable end‑users to “talk” to a UI builder (e.g., “add a dropdown below the search bar”). The benchmark highlights where current models would fail, guiding engineering focus.
- Accessibility & localization: Since text rendering is a weak spot, developers aiming to auto‑generate multilingual interfaces must invest in specialized text‑aware diffusion or post‑processing pipelines.
Limitations & Future Work
- Domain coverage: GEBench focuses on mobile/web UI patterns; specialized domains (e.g., automotive dashboards, VR interfaces) are not represented.
- Metric reliance on classifiers: Some GE‑Score components depend on pretrained detectors that may inherit biases from their training data.
- Scalability of multi‑step evaluation: Longer interaction chains (>10 steps) are not yet included, limiting insight into very deep workflows.
- Future directions suggested by the authors include expanding to 3‑D UI environments, incorporating user interaction timing (latency), and developing hybrid models that combine layout graphs with diffusion for better icon and text fidelity.
Authors
- Haodong Li
- Jingwei Wu
- Quan Sun
- Guopeng Li
- Juanxi Tian
- Huanyu Zhang
- Yanlin Lai
- Ruichuan An
- Hongbo Peng
- Yuhong Dai
- Chenxi Li
- Chunmei Qing
- Jia Wang
- Ziyang Meng
- Zheng Ge
- Xiangyu Zhang
- Daxin Jiang
Paper Information
- arXiv ID: 2602.09007v1
- Categories: cs.AI, cs.CV
- Published: February 9, 2026
- PDF: Download PDF