[Paper] MAI-UI Technical Report: Real-World Centric Foundation GUI Agents
Source: arXiv - 2512.22047v1
Overview
The MAI‑UI technical report introduces a new family of “foundation GUI agents” that can understand and operate real‑world graphical user interfaces (GUIs) across devices—from tiny 2 B‑parameter models up to a massive 235 B‑parameter variant. By tackling the gap between research prototypes and production‑ready agents, the authors demonstrate that large‑scale, self‑evolving agents can reliably navigate and manipulate modern mobile and desktop UIs while preserving privacy and minimizing cloud dependence.
Key Contributions
- A spectrum of foundation GUI agents (2 B, 8 B, 32 B, 235 B‑A22 B) that can be swapped depending on latency, compute budget, or privacy requirements.
- Self‑evolving data pipeline that continuously augments training data with real user interactions and tool‑call traces, turning static UI screenshots into rich, action‑oriented datasets.
- Native device‑cloud collaboration architecture that routes tasks between on‑device inference and cloud‑backed models, cutting cloud calls by >40 % and boosting on‑device speed by 33 %.
- Scalable online reinforcement‑learning (RL) framework with optimizations for parallel environments (up to 512 workers) and extended context windows, delivering consistent performance gains.
- State‑of‑the‑art results on multiple GUI grounding (ScreenSpot‑Pro, MMBench L2, OSWorld‑G, UI‑Vision) and navigation benchmarks (AndroidWorld, MobileWorld), surpassing leading baselines such as Gemini‑3‑Pro and Seed 1.8.
Methodology
-
Data Collection & Evolution – Starting from existing UI‑only datasets, the team runs agents in the wild, captures user‑agent interaction logs (clicks, swipes, text entry) and MCP (mobile‑cloud‑processing) tool calls, then feeds these back into the training loop. This creates a continuously improving corpus that reflects real usage patterns.
-
Model Architecture – All agents share a common transformer backbone but differ in size. The architecture is augmented with a GUI grounding head (pixel‑to‑element mapping) and a policy head that predicts the next UI action (e.g., tap, scroll, type).
-
Device‑Cloud Collaboration – A lightweight runtime on the device decides, per step, whether the next inference can be satisfied locally or needs cloud assistance (e.g., for complex reasoning). The decision is based on current latency budget, privacy flags, and model confidence.
-
Online RL Training – Agents are fine‑tuned in a simulated environment pool that mirrors Android/iOS UI flows. Parallelism is scaled from 32 to 512 environments, and the step budget per episode is increased from 15 to 50, allowing the policy to learn longer‑horizon strategies.
-
Optimization Tricks – Gradient checkpointing, mixed‑precision training, and a dynamic context‑length scheduler keep memory usage tractable even for the 235 B model.
Results & Findings
| Benchmark | Metric (higher = better) | MAI‑UI (best variant) | Prior Best |
|---|---|---|---|
| ScreenSpot‑Pro (GUI grounding) | Accuracy | 73.5 % | Gemini‑3‑Pro (≈71 %) |
| MMBench GUI L2 | Accuracy | 91.3 % | Seed 1.8 (≈88 %) |
| OSWorld‑G | Accuracy | 70.9 % | Gemini‑3‑Pro (≈68 %) |
| UI‑Vision | Accuracy | 49.2 % | Seed 1.8 (≈45 %) |
| AndroidWorld (navigation) | Success rate | 76.7 % | UI‑Tars‑2 (≈73 %) |
| MobileWorld (navigation) | Success rate | 41.7 % | End‑to‑end GUI models (~30 %) |
RL scaling experiments: increasing parallel environments from 32 → 512 added +5.2 % points; extending step budget from 15 → 50 added +4.3 % points.
The native device‑cloud system reduced average latency per action by 33 %, cut cloud API calls by >40 %, and kept user data on‑device, addressing privacy concerns.
Practical Implications
- Developer Tooling – MAI‑UI can be wrapped as a plug‑and‑play SDK for mobile apps, enabling features like automated UI testing, in‑app assistants, or accessibility helpers without writing custom scripts.
- Edge‑First Deployments – Smaller 2 B/8 B variants run entirely on‑device, making them suitable for low‑power IoT devices, wearables, or privacy‑sensitive applications (e.g., banking apps).
- Reduced Cloud Costs – The collaboration layer means only “hard” reasoning steps hit the cloud, slashing bandwidth and compute bills for large‑scale deployments (e.g., enterprise device fleets).
- Rapid Prototyping – The self‑evolving pipeline automatically incorporates new UI patterns as apps update, so developers spend less time curating training data and more time building features.
- Cross‑Platform Consistency – Because the same model family can handle Android, iOS, and desktop UIs, teams can maintain a single agent codebase across platforms, simplifying maintenance.
Limitations & Future Work
- Dynamic UI Variability – Extremely custom or rapidly changing UI elements (e.g., dynamic ads) still cause occasional failures.
- Resource Footprint for Largest Model – The 235 B‑parameter variant requires high‑end GPUs/TPUs and is currently only practical in a cloud setting; further model compression work is needed for broader edge use.
- Evaluation Scope – Benchmarks focus on navigation and grounding; richer multi‑modal tasks (e.g., voice‑guided UI control, multimodal reasoning across text and graphics) remain under‑explored.
- Privacy Guarantees – While on‑device inference reduces data exposure, the system still sends occasional context to the cloud; formal privacy audits and differential‑privacy mechanisms are planned.
Future directions include tighter integration with OS accessibility APIs, extending the RL curriculum to multi‑task scenarios (e.g., form filling + error recovery), and exploring distillation techniques to bring near‑state‑of‑the‑art performance to sub‑100 MB models.
Authors
- Hanzhang Zhou
- Xu Zhang
- Panrong Tong
- Jianan Zhang
- Liangyu Chen
- Quyu Kong
- Chenglin Cai
- Chen Liu
- Yue Wang
- Jingren Zhou
- Steven Hoi
Paper Information
- arXiv ID: 2512.22047v1
- Categories: cs.CV
- Published: December 26, 2025
- PDF: Download PDF