[Paper] AndroidLens: Long-latency Evaluation with Nested Sub-targets for Android GUI Agents
Source: arXiv - 2512.21302v1
Overview
The paper presents AndroidLens, a new benchmark designed to rigorously evaluate mobile GUI agents that automate long‑latency tasks on Android devices. By assembling 571 real‑world tasks across 38 domains (both Chinese and English) and introducing fine‑grained progress metrics, the authors expose the current limits of state‑of‑the‑art agents and point to concrete research and engineering gaps.
Key Contributions
- Large‑scale, diverse task suite – 571 multi‑step tasks (average > 26 actions) drawn from authentic user scenarios in 38 application domains.
- Nested sub‑target design – each task is broken into hierarchical sub‑goals, enabling evaluation of both high‑level success and intermediate reasoning.
- Static evaluation with multiple valid paths – preserves real‑world UI anomalies (ads, pop‑ups, layout changes) while allowing different correct execution traces, reducing bias toward a single “gold” path.
- Dynamic milestone‑based metric (Average Task Progress, ATP) – measures fine‑grained progress rather than a binary success/failure, giving insight into partial competence.
- Comprehensive baseline study – evaluates several leading GUI‑agent models, revealing a best‑case 12.7 % task success and 50.47 % ATP, highlighting the difficulty of long‑latency automation.
Methodology
- Task Collection – The authors mined user forums, support tickets, and crowdsourced scripts to extract realistic automation scenarios (e.g., “batch upload photos with size constraints”, “reserve a train ticket while handling captcha”).
- Task Annotation – Each scenario is annotated with a hierarchy of sub‑targets (e.g., “open app → navigate to settings → toggle option”). Multiple valid UI paths are recorded to reflect UI variability.
- Static Evaluation – Agents run against a frozen snapshot of the app UI. The system checks whether the agent follows any of the recorded valid paths, tolerating UI anomalies like ads or layout shifts.
- Dynamic Evaluation – While the agent interacts with the live device, the framework inserts milestones (pre‑defined checkpoints). After each action, it computes the proportion of completed milestones, yielding the Average Task Progress (ATP) score.
- Baseline Models – The study tests several recent vision‑language agents (e.g., Pix2Seq‑based, Transformer‑based UI parsers) under identical conditions, reporting success rate and ATP.
Results & Findings
| Metric | Best Model | Average Across Models |
|---|---|---|
| Task Success Rate | 12.7 % | 5.3 % |
| Average Task Progress (ATP) | 50.47 % | 31.2 % |
- Low success despite strong language models – Even the top‑performing agent fails on ~87 % of tasks, confirming that long‑latency, multi‑constraint automation remains an open problem.
- Partial progress is common – Many agents achieve roughly half the milestones, indicating they can navigate UI structures but stumble on constraints, error handling, or memory‑dependent steps.
- Key failure modes:
- Environmental anomalies: unexpected pop‑ups, dynamic ads, and UI layout changes break rigid action sequences.
- Adaptive exploration: agents often cannot decide when to backtrack or try alternative UI paths.
- Long‑term memory: retaining information across >20 steps (e.g., a verification code) is still unreliable.
Practical Implications
- Tooling for enterprise automation – Companies looking to automate repetitive mobile workflows (e.g., bulk data entry, ticket booking) should temper expectations; current agents need substantial engineering (fallback handling, custom scripts) to reach production reliability.
- Benchmark‑driven development – AndroidLens offers a ready‑made test suite for developers building custom GUI bots, enabling rapid iteration on robustness to UI noise and multi‑step reasoning.
- Hybrid approaches – The gap between success and ATP suggests a promising direction: combine vision‑language agents with rule‑based controllers or memory modules (e.g., external key‑value stores) to handle constraints and long‑term state.
- Cross‑language support – Inclusion of both Chinese and English tasks highlights the need for multilingual UI understanding, relevant for global apps and localization pipelines.
Limitations & Future Work
- Static snapshot bias – While preserving anomalies, the static mode cannot capture runtime performance variations (network latency, background processes).
- Domain coverage – Although 38 domains are broad, certain enterprise‑grade apps (e.g., finance, healthcare) with strict security flows are not represented.
- Memory evaluation – The benchmark measures progress but does not isolate memory‑specific failures; future work could add explicit “recall” checkpoints.
- Agent diversity – The baseline study focuses on a handful of publicly available models; expanding to proprietary or emerging multimodal agents would further validate the benchmark’s difficulty.
AndroidLens sets a higher bar for mobile GUI automation research and gives developers a realistic yardstick to gauge how far current AI agents are from being truly useful in production environments.
Authors
- Yue Cao
- Yingyao Wang
- Pi Bu
- Jingxuan Xing
- Wei Jiang
- Zekun Zhu
- Junpeng Ma
- Sashuai Zhou
- Tong Lu
- Jun Song
- Yu Cheng
- Yuning Jiang
- Bo Zheng
Paper Information
- arXiv ID: 2512.21302v1
- Categories: cs.CV
- Published: December 24, 2025
- PDF: Download PDF