[Paper] SWE-Bench Mobile: Can Large Language Model Agents Develop Industry-Level Mobile Applications?
Source: arXiv - 2602.09540v1
Overview
The paper introduces SWE‑Bench Mobile, a new benchmark that puts large‑language‑model (LLM) coding agents through their paces on realistic, production‑grade iOS development tasks. By using real product requirement documents (PRDs), Figma UI designs, and a mixed Swift/Objective‑C codebase, the authors expose how far current agents are from delivering industry‑level mobile apps.
Key Contributions
- A first‑of‑its‑kind mobile‑app benchmark that combines multi‑modal inputs (text specs + design mock‑ups) with a large, real‑world iOS codebase and exhaustive test suites.
- Comprehensive evaluation of 22 agent‑model configurations across four coding agents (three commercial: Cursor, Codex, Claude Code; one open‑source: OpenCode).
- Empirical findings that the best agent only solves 12 % of tasks, revealing a sizable gap between research prototypes and production needs.
- Insightful ablation studies showing:
- Agent architecture matters as much as the underlying LLM (up to 6× performance differences).
- Commercial agents consistently beat open‑source alternatives.
- Simple “Defensive Programming” prompting outperforms more elaborate prompt engineering by 7.4 %.
- Publicly hosted benchmark platform – https://swebenchmobile.com – that prevents data leakage and provides a leaderboard and toolkit for reproducible research.
Methodology
- Task Collection – The authors mined a mature iOS project used in production, extracting 100+ feature‑level tasks that span new feature implementation, UI integration, and bug fixing.
- Multi‑modal Specification – Each task is accompanied by a textual PRD and a corresponding Figma design file, mirroring how developers receive requirements in industry.
- Agent Configurations – Four distinct agents were built, each wrapping a different LLM (e.g., GPT‑4‑based, Claude‑based). For each agent, the authors tried multiple prompt styles (defensive programming, chain‑of‑thought, etc.) and tool‑use settings (e.g., code search, test execution).
- Evaluation Pipeline – Agents generate code patches, which are automatically applied to the codebase and run against a comprehensive test suite. Success is measured by passing all relevant tests and meeting the specification.
- Metrics & Analysis – Success rate, time‑to‑completion, and prompt‑efficiency were recorded. Ablation experiments isolate the impact of agent design, model size, and prompting strategy.
Results & Findings
- Overall success: The top‑performing configuration (a commercial agent with defensive‑programming prompts) solved only 12 % of the tasks.
- Agent vs. Model: The same LLM yielded up to 6× difference in success depending on the surrounding agent framework (e.g., how it orchestrates search, test runs, and iteration).
- Commercial vs. Open‑source: Commercial agents (Cursor, Codex, Claude Code) consistently outperformed the open‑source OpenCode baseline (average gap ≈ 4 %).
- Prompting matters: Simple defensive‑programming prompts (encouraging the model to write safe, test‑driven code) beat more complex chain‑of‑thought or “role‑playing” prompts by 7.4 % absolute success.
- Failure modes: Most errors stemmed from misunderstanding UI design constraints, misusing Objective‑C/Swift interop, and insufficient handling of asynchronous APIs—issues rarely captured in synthetic benchmarks.
Practical Implications
- Tooling Vendors – The stark performance gap suggests that current LLM‑powered IDE assistants are not ready for end‑to‑end mobile feature delivery. Vendors should invest in tighter integration with design assets (Figma APIs) and robust test‑driven generation loops.
- Dev Teams – Teams can use SWE‑Bench Mobile as a sanity check for any in‑house coding‑assistant before relying on it for production work. The benchmark’s “defensive programming” prompt style is a low‑effort win that can be adopted immediately.
- Open‑source Community – The open‑source OpenCode baseline highlights opportunities for community‑driven improvements (e.g., better Swift/Objective‑C tokenization, specialized retrieval over iOS SDK docs).
- Hiring & Skill Assessment – Recruiters could employ the benchmark to gauge a candidate’s ability to work with LLM agents, complementing traditional coding interviews.
- Future Product Roadmaps – Companies building “AI‑first” development platforms now have concrete data points (success rates, failure categories) to prioritize features such as multimodal design ingestion, automated UI testing, and cross‑language code synthesis.
Limitations & Future Work
- Scope limited to iOS – While the benchmark is extensive for Swift/Objective‑C, results may not directly transfer to Android or cross‑platform frameworks.
- Static test suites – The evaluation relies on pre‑written unit/UI tests; real‑world QA often involves exploratory testing that agents currently cannot emulate.
- Prompt engineering space – Only a handful of prompt styles were explored; more sophisticated meta‑prompting or RL‑based prompt optimization could yield higher success.
- Model access constraints – Some commercial agents were evaluated via black‑box APIs, limiting insight into internal model behavior. Future work could open up more transparent model checkpoints for deeper analysis.
The authors invite the community to contribute new tasks, agents, and prompt ideas through the hosted benchmark, aiming to accelerate the journey from “code‑suggestion” to truly autonomous mobile app development.
Authors
- Muxin Tian
- Zhe Wang
- Blair Yang
- Zhenwei Tang
- Kunlun Zhu
- Honghua Dong
- Hanchen Li
- Xinni Xie
- Guangjing Wang
- Jiaxuan You
Paper Information
- arXiv ID: 2602.09540v1
- Categories: cs.SE
- Published: February 10, 2026
- PDF: Download PDF