[Paper] iOSWorld: A Benchmark for Personally Intelligent Phone Agents

Published: 3 days ago (June 8, 2026 at 01:27 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.09764v1

Overview

A useful phone agent needs to be personally intelligent. It should reason over a user’s identity, history, and preferences as they exist on the device, not just follow isolated instructions in an impersonal sandbox. Existing mobile agent benchmarks lack this kind of personalization. We introduce iOSWorld, the first interactive native iOS simulator benchmark built around a persistent user identity spanning 26 newly built iOS apps. These apps contain connected data such as transactions, messages, travel records, social relationships, and financial activity. iOSWorld includes 133 tasks across three increasingly difficult categories. Single-app tasks (27) test one app, multi-app tasks (60) span 2 to 8 apps, and memory and personalization tasks (46) require agents to infer patterns from personal data. We evaluate frontier and open-source computer-use models in both vision-only and privileged vision+XML settings. The best configuration reaches 52% overall but only 37% on multi-app tasks. Privileged vision+XML access improves frontier models by up to 26 percentage points, while smaller models do not benefit from added accessibility-tree input. We release iOSWorld as an open-source benchmark with all apps, seeded data, tasks, rubrics, and evaluation code.

Key Contributions

This paper presents research in the following areas:

cs.LG
cs.CL

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.LG.

Authors

Lawrence Keunho Jang
Mareks Woodside
Geronimo Carom
Andrew Keunwoo Jang
Jing Yu Koh
Ruslan Salakhutdinov

Paper Information

arXiv ID: 2606.09764v1
Categories: cs.LG, cs.CL
Published: June 8, 2026
PDF: Download PDF

[Paper] iOSWorld: A Benchmark for Personally Intelligent Phone Agents

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Context-Driven Incremental Compression for Multi-Turn Dialogue Generation

[Paper] Redesign Mixture-of-Experts Routers with Manifold Power Iteration

[Paper] System Report for CCL25-Eval Task 5: New Dataset and LoRA-Fine-Tuned Qwen2.5

[Paper] Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling