[Paper] AndroidLens: Long-latency Evaluation with Nested Sub-targets for Android GUI Agents

Published: 1 month ago (December 24, 2025 at 12:40 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.21302v1

Overview

The paper presents AndroidLens, a new benchmark designed to rigorously evaluate mobile GUI agents that automate long‑latency tasks on Android devices. By assembling 571 real‑world tasks across 38 domains (both Chinese and English) and introducing fine‑grained progress metrics, the authors expose the current limits of state‑of‑the‑art agents and point to concrete research and engineering gaps.

Key Contributions

Large‑scale, diverse task suite – 571 multi‑step tasks (average > 26 actions) drawn from authentic user scenarios in 38 application domains.
Nested sub‑target design – each task is broken into hierarchical sub‑goals, enabling evaluation of both high‑level success and intermediate reasoning.
Static evaluation with multiple valid paths – preserves real‑world UI anomalies (ads, pop‑ups, layout changes) while allowing different correct execution traces, reducing bias toward a single “gold” path.
Dynamic milestone‑based metric (Average Task Progress, ATP) – measures fine‑grained progress rather than a binary success/failure, giving insight into partial competence.
Comprehensive baseline study – evaluates several leading GUI‑agent models, revealing a best‑case 12.7 % task success and 50.47 % ATP, highlighting the difficulty of long‑latency automation.

Methodology

Task Collection – The authors mined user forums, support tickets, and crowdsourced scripts to extract realistic automation scenarios (e.g., “batch upload photos with size constraints”, “reserve a train ticket while handling captcha”).
Task Annotation – Each scenario is annotated with a hierarchy of sub‑targets (e.g., “open app → navigate to settings → toggle option”). Multiple valid UI paths are recorded to reflect UI variability.
Static Evaluation – Agents run against a frozen snapshot of the app UI. The system checks whether the agent follows any of the recorded valid paths, tolerating UI anomalies like ads or layout shifts.
Dynamic Evaluation – While the agent interacts with the live device, the framework inserts milestones (pre‑defined checkpoints). After each action, it computes the proportion of completed milestones, yielding the Average Task Progress (ATP) score.
Baseline Models – The study tests several recent vision‑language agents (e.g., Pix2Seq‑based, Transformer‑based UI parsers) under identical conditions, reporting success rate and ATP.

Results & Findings

Metric	Best Model	Average Across Models
Task Success Rate	12.7 %	5.3 %
Average Task Progress (ATP)	50.47 %	31.2 %

Low success despite strong language models – Even the top‑performing agent fails on ~87 % of tasks, confirming that long‑latency, multi‑constraint automation remains an open problem.
Partial progress is common – Many agents achieve roughly half the milestones, indicating they can navigate UI structures but stumble on constraints, error handling, or memory‑dependent steps.
Key failure modes:
- Environmental anomalies: unexpected pop‑ups, dynamic ads, and UI layout changes break rigid action sequences.
- Adaptive exploration: agents often cannot decide when to backtrack or try alternative UI paths.
- Long‑term memory: retaining information across >20 steps (e.g., a verification code) is still unreliable.

Practical Implications

Tooling for enterprise automation – Companies looking to automate repetitive mobile workflows (e.g., bulk data entry, ticket booking) should temper expectations; current agents need substantial engineering (fallback handling, custom scripts) to reach production reliability.
Benchmark‑driven development – AndroidLens offers a ready‑made test suite for developers building custom GUI bots, enabling rapid iteration on robustness to UI noise and multi‑step reasoning.
Hybrid approaches – The gap between success and ATP suggests a promising direction: combine vision‑language agents with rule‑based controllers or memory modules (e.g., external key‑value stores) to handle constraints and long‑term state.
Cross‑language support – Inclusion of both Chinese and English tasks highlights the need for multilingual UI understanding, relevant for global apps and localization pipelines.

Limitations & Future Work

Static snapshot bias – While preserving anomalies, the static mode cannot capture runtime performance variations (network latency, background processes).
Domain coverage – Although 38 domains are broad, certain enterprise‑grade apps (e.g., finance, healthcare) with strict security flows are not represented.
Memory evaluation – The benchmark measures progress but does not isolate memory‑specific failures; future work could add explicit “recall” checkpoints.
Agent diversity – The baseline study focuses on a handful of publicly available models; expanding to proprietary or emerging multimodal agents would further validate the benchmark’s difficulty.

AndroidLens sets a higher bar for mobile GUI automation research and gives developers a realistic yardstick to gauge how far current AI agents are from being truly useful in production environments.

Authors

Yue Cao
Yingyao Wang
Pi Bu
Jingxuan Xing
Wei Jiang
Zekun Zhu
Junpeng Ma
Sashuai Zhou
Tong Lu
Jun Song
Yu Cheng
Yuning Jiang
Bo Zheng

Paper Information

arXiv ID: 2512.21302v1
Categories: cs.CV
Published: December 24, 2025
PDF: Download PDF

[Paper] AndroidLens: Long-latency Evaluation with Nested Sub-targets for Android GUI Agents

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

[Paper] ProEdit: Inversion-based Editing From Prompts Done Right

[Paper] Learning Association via Track-Detection Matching for Multi-Object Tracking

[Paper] Yume-1.5: A Text-Controlled Interactive World Generation Model