[Paper] MAI-UI Technical Report: Real-World Centric Foundation GUI Agents

Published: 1 month ago (December 26, 2025 at 09:51 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.22047v1

Overview

The MAI‑UI technical report introduces a new family of “foundation GUI agents” that can understand and operate real‑world graphical user interfaces (GUIs) across devices—from tiny 2 B‑parameter models up to a massive 235 B‑parameter variant. By tackling the gap between research prototypes and production‑ready agents, the authors demonstrate that large‑scale, self‑evolving agents can reliably navigate and manipulate modern mobile and desktop UIs while preserving privacy and minimizing cloud dependence.

Key Contributions

A spectrum of foundation GUI agents (2 B, 8 B, 32 B, 235 B‑A22 B) that can be swapped depending on latency, compute budget, or privacy requirements.
Self‑evolving data pipeline that continuously augments training data with real user interactions and tool‑call traces, turning static UI screenshots into rich, action‑oriented datasets.
Native device‑cloud collaboration architecture that routes tasks between on‑device inference and cloud‑backed models, cutting cloud calls by >40 % and boosting on‑device speed by 33 %.
Scalable online reinforcement‑learning (RL) framework with optimizations for parallel environments (up to 512 workers) and extended context windows, delivering consistent performance gains.
State‑of‑the‑art results on multiple GUI grounding (ScreenSpot‑Pro, MMBench L2, OSWorld‑G, UI‑Vision) and navigation benchmarks (AndroidWorld, MobileWorld), surpassing leading baselines such as Gemini‑3‑Pro and Seed 1.8.

Methodology

Data Collection & Evolution – Starting from existing UI‑only datasets, the team runs agents in the wild, captures user‑agent interaction logs (clicks, swipes, text entry) and MCP (mobile‑cloud‑processing) tool calls, then feeds these back into the training loop. This creates a continuously improving corpus that reflects real usage patterns.
Model Architecture – All agents share a common transformer backbone but differ in size. The architecture is augmented with a GUI grounding head (pixel‑to‑element mapping) and a policy head that predicts the next UI action (e.g., tap, scroll, type).
Device‑Cloud Collaboration – A lightweight runtime on the device decides, per step, whether the next inference can be satisfied locally or needs cloud assistance (e.g., for complex reasoning). The decision is based on current latency budget, privacy flags, and model confidence.
Online RL Training – Agents are fine‑tuned in a simulated environment pool that mirrors Android/iOS UI flows. Parallelism is scaled from 32 to 512 environments, and the step budget per episode is increased from 15 to 50, allowing the policy to learn longer‑horizon strategies.
Optimization Tricks – Gradient checkpointing, mixed‑precision training, and a dynamic context‑length scheduler keep memory usage tractable even for the 235 B model.

Results & Findings

Benchmark	Metric (higher = better)	MAI‑UI (best variant)	Prior Best
ScreenSpot‑Pro (GUI grounding)	Accuracy	73.5 %	Gemini‑3‑Pro (≈71 %)
MMBench GUI L2	Accuracy	91.3 %	Seed 1.8 (≈88 %)
OSWorld‑G	Accuracy	70.9 %	Gemini‑3‑Pro (≈68 %)
UI‑Vision	Accuracy	49.2 %	Seed 1.8 (≈45 %)
AndroidWorld (navigation)	Success rate	76.7 %	UI‑Tars‑2 (≈73 %)
MobileWorld (navigation)	Success rate	41.7 %	End‑to‑end GUI models (~30 %)

RL scaling experiments: increasing parallel environments from 32 → 512 added +5.2 % points; extending step budget from 15 → 50 added +4.3 % points.

The native device‑cloud system reduced average latency per action by 33 %, cut cloud API calls by >40 %, and kept user data on‑device, addressing privacy concerns.

Practical Implications

Developer Tooling – MAI‑UI can be wrapped as a plug‑and‑play SDK for mobile apps, enabling features like automated UI testing, in‑app assistants, or accessibility helpers without writing custom scripts.
Edge‑First Deployments – Smaller 2 B/8 B variants run entirely on‑device, making them suitable for low‑power IoT devices, wearables, or privacy‑sensitive applications (e.g., banking apps).
Reduced Cloud Costs – The collaboration layer means only “hard” reasoning steps hit the cloud, slashing bandwidth and compute bills for large‑scale deployments (e.g., enterprise device fleets).
Rapid Prototyping – The self‑evolving pipeline automatically incorporates new UI patterns as apps update, so developers spend less time curating training data and more time building features.
Cross‑Platform Consistency – Because the same model family can handle Android, iOS, and desktop UIs, teams can maintain a single agent codebase across platforms, simplifying maintenance.

Limitations & Future Work

Dynamic UI Variability – Extremely custom or rapidly changing UI elements (e.g., dynamic ads) still cause occasional failures.
Resource Footprint for Largest Model – The 235 B‑parameter variant requires high‑end GPUs/TPUs and is currently only practical in a cloud setting; further model compression work is needed for broader edge use.
Evaluation Scope – Benchmarks focus on navigation and grounding; richer multi‑modal tasks (e.g., voice‑guided UI control, multimodal reasoning across text and graphics) remain under‑explored.
Privacy Guarantees – While on‑device inference reduces data exposure, the system still sends occasional context to the cloud; formal privacy audits and differential‑privacy mechanisms are planned.

Future directions include tighter integration with OS accessibility APIs, extending the RL curriculum to multi‑task scenarios (e.g., form filling + error recovery), and exploring distillation techniques to bring near‑state‑of‑the‑art performance to sub‑100 MB models.

Authors

Hanzhang Zhou
Xu Zhang
Panrong Tong
Jianan Zhang
Liangyu Chen
Quyu Kong
Chenglin Cai
Chen Liu
Yue Wang
Jingren Zhou
Steven Hoi

Paper Information

arXiv ID: 2512.22047v1
Categories: cs.CV
Published: December 26, 2025
PDF: Download PDF

[Paper] MAI-UI Technical Report: Real-World Centric Foundation GUI Agents

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

[Paper] ProEdit: Inversion-based Editing From Prompts Done Right

[Paper] Learning Association via Track-Detection Matching for Multi-Object Tracking

[Paper] Yume-1.5: A Text-Controlled Interactive World Generation Model