[Paper] From User Interface to Agent Interface: Efficiency Optimization of UI Representations for LLM Agents
Source: arXiv - 2512.13438v1
Overview
The paper introduces UIFormer, a novel framework that automatically rewrites user‑interface (UI) representations to make them more compact for Large Language Model (LLM) agents. By trimming the amount of UI “token” data that an LLM has to process, UIFormer speeds up tasks such as automated UI testing, AI‑driven assistants, and cross‑platform navigation—without sacrificing accuracy.
Key Contributions
- First automated UI‑representation optimizer for LLM agents, tackling both token efficiency and functional completeness.
- Domain‑specific language (DSL) that encodes common UI transformation primitives (e.g., pruning invisible nodes, merging similar widgets).
- Constraint‑based synthesis + LLM‑guided refinement: a two‑stage pipeline that narrows the program search space and iteratively improves solutions using correctness and efficiency rewards.
- Lightweight plug‑in architecture that can be dropped into existing LLM‑based agents with negligible code changes.
- Extensive evaluation on Android and Web UI navigation benchmarks (3 datasets, 5 LLM back‑ends) showing 48 %–56 % token reduction and equal or better task performance.
- Real‑world validation through deployment in WeChat’s UI automation pipeline, confirming industrial relevance.
Methodology
- Problem Formulation – The authors view UI optimization as a program synthesis task: given a raw UI tree, synthesize a transformation program that outputs a smaller, semantically equivalent representation.
- DSL Design – The DSL contains a small set of UI‑specific operators (e.g.,
remove_hidden,collapse_group,abstract_text). This restricts the search space and guarantees that generated programs stay within the UI domain. - Constraint‑Based Decomposition – UIFormer first breaks the large synthesis problem into smaller sub‑problems (e.g., per screen region) and applies static constraints (type safety, hierarchy preservation) to prune invalid programs early.
- LLM‑Driven Iterative Refinement – A chosen LLM (e.g., GPT‑4, Claude) proposes candidate programs. Each candidate is evaluated with two rewards:
- Correctness reward – checks that the transformed UI still satisfies a set of functional tests (e.g., can still locate target widgets).
- Efficiency reward – measures token count reduction.
The LLM is prompted to improve the program until both rewards converge.
- Plug‑in Integration – UIFormer runs as a pre‑processing step: the agent receives the optimized UI representation, executes its normal reasoning, and the plug‑in can optionally post‑process results back to the original UI if needed.
Results & Findings
| Benchmark | LLM | Token Reduction | Agent Success Rate |
|---|---|---|---|
| Android UI‑Nav (3k screens) | GPT‑4 | 52.3 % | +1.2 % |
| Web UI‑Nav (2.5k pages) | Claude 2 | 48.7 % | unchanged |
| Mixed‑Platform (1.8k screens) | Llama‑2‑70B | 55.8 % | +0.8 % |
- Runtime overhead stayed under 120 ms per UI, negligible compared with the LLM inference time.
- Robustness: In >95 % of cases the transformed UI passed the same functional test suite as the original, confirming semantic preservation.
- Industry deployment: At WeChat, UIFormer cut average API payload size by ~50 % and reduced end‑to‑end latency of UI‑automation bots by ~30 ms, enabling higher throughput for daily automated testing runs.
Practical Implications
- Faster LLM agents – Smaller UI payloads mean less context for the LLM to embed, directly lowering token‑based cost (e.g., OpenAI API pricing) and inference latency.
- Scalable UI automation – Teams can run more concurrent UI‑testing bots on the same hardware budget, especially valuable for large mobile/web app suites.
- Edge deployment – On devices with limited bandwidth (e.g., IoT dashboards), transmitting a compact UI representation eases real‑time LLM assistance.
- Plug‑and‑play adoption – Since UIFormer is a thin pre‑processor, existing codebases (Selenium, Appium, custom UI agents) can be upgraded without rewriting core logic.
- Cross‑platform consistency – The DSL abstracts away platform‑specific quirks, allowing a single optimization pipeline for Android, iOS, and web UIs.
Limitations & Future Work
- Dependence on functional test oracle – Correctness rewards rely on a set of UI‑level tests; in domains lacking comprehensive test suites, guaranteeing semantic preservation may be harder.
- DSL expressiveness – While the current DSL covers common pruning and abstraction patterns, exotic UI widgets (custom canvas elements, AR overlays) may need extensions.
- LLM bias – The iterative refinement step inherits any hallucination tendencies of the underlying LLM; occasional manual inspection may still be required for safety‑critical applications.
- Future directions suggested by the authors include:
- Learning a data‑driven DSL from large UI corpora.
- Integrating reinforcement learning to replace hand‑crafted rewards.
- Extending UIFormer to handle dynamic, event‑driven UI states (e.g., animations, lazy‑loaded content).
Authors
- Dezhi Ran
- Zhi Gong
- Yuzhe Guo
- Mengzhou Wu
- Yuan Cao
- Haochuan Lu
- Hengyu Zhang
- Xia Zeng
- Gang Cao
- Liangchao Yao
- Yuetang Deng
- Wei Yang
- Tao Xie
Paper Information
- arXiv ID: 2512.13438v1
- Categories: cs.SE, cs.AI
- Published: December 15, 2025
- PDF: Download PDF