[Paper] On the Adoption of AI Coding Agents in Open-source Android and iOS Development
Source: arXiv - 2602.12144v1
Overview
The paper presents the first large‑scale empirical look at how AI‑powered coding assistants (e.g., GitHub Copilot, Code Llama, Claude) are being used in real‑world open‑source Android and iOS projects. By mining 2,901 AI‑authored pull requests (PRs) from 193 repositories, the authors reveal platform‑specific adoption patterns, acceptance rates, and the kinds of tasks where AI contributions succeed—or stumble.
Key Contributions
- Dataset creation – Curated the AIDev dataset, a verified collection of AI‑generated PRs for Android (1,721 PRs) and iOS (1,180 PRs) open‑source apps.
- Cross‑platform comparison – Demonstrated that Android projects receive roughly twice as many AI PRs and enjoy a higher acceptance rate (71 % vs. 63 % for iOS).
- Agent‑level analysis – Showed significant variance among different coding agents on Android, highlighting that not all assistants perform equally.
- Task‑category breakdown – Identified that routine tasks (feature additions, bug fixes, UI tweaks) are most likely to be merged, while structural changes (refactors, build‑system edits) face lower acceptance and longer review cycles.
- Temporal evolution – Tracked PR resolution times over 2023‑2025, finding an improvement peak on Android mid‑2025 before a slight regression.
- Baseline for future research – Provides the first quantitative benchmarks for evaluating AI‑generated contributions in mobile OSS, paving the way for platform‑aware agent design.
Methodology
- Data collection – Queried GitHub’s REST API for PRs that explicitly credit an AI tool in the description or commit metadata.
- Verification – Applied a two‑step manual vetting process to ensure the PRs were truly AI‑authored (e.g., checking for generated code snippets, tool‑specific tags).
- Categorization – Mapped each PR to a task category (feature, bug‑fix, UI, refactor, build, docs, etc.) using a combination of keyword heuristics and manual labeling.
- Statistical analysis – Compared acceptance rates, time‑to‑merge, and reviewer comments across platforms, agents, and categories using chi‑square tests and survival analysis for resolution time trends.
- Temporal slicing – Split the data into quarterly windows to observe how AI contribution dynamics evolve over time.
The approach stays lightweight enough for developers to follow while still delivering rigorous, reproducible results.
Results & Findings
| Dimension | Android | iOS |
|---|---|---|
| AI PR volume | 1,721 (≈ 60 % of total) | 1,180 (≈ 40 %) |
| Acceptance rate | 71 % merged | 63 % merged |
| Top‑performing agents | Agent A (78 % merge), Agent B (73 %) | Agent C (68 % merge) – less variance |
| Best‑rated task categories | Feature, Bug‑Fix, UI (≈ 75‑80 % merge) | Same trend, slightly lower (≈ 70‑75 % merge) |
| Hardest task categories | Refactor, Build (≈ 55‑60 % merge) | Refactor, Build (≈ 50‑55 % merge) |
| Resolution time trend | Median time dropped from 5 days (2023 Q1) to 2 days (mid‑2025) then rose to 3 days (late‑2025) | Steady around 4‑5 days, minor fluctuations |
What it means:
- Developers on Android are more willing to accept AI‑generated changes, possibly due to a larger ecosystem of tooling and community norms.
- Routine, well‑scoped changes are where AI agents shine; deeper architectural edits still need human oversight.
- The “sweet spot” for AI contribution speed peaked in mid‑2025, suggesting that recent model improvements translated into faster review cycles—until a saturation or quality dip set in.
Practical Implications
- Tool selection: Teams can prioritize agents that have demonstrated higher acceptance on Android (e.g., Agent A) when targeting that platform, while being more cautious on iOS.
- Workflow design: Encourage developers to use AI for incremental features, UI tweaks, and bug fixes, but route refactors and build‑system changes through a stricter review gate or a human‑first approach.
- CI/CD integration: Since AI PRs resolve faster on Android, CI pipelines can be tuned to auto‑merge low‑risk AI contributions after a brief automated verification step, accelerating release cycles.
- Community guidelines: Open‑source maintainers might adopt policies that require explicit AI attribution and a short human sanity‑check checklist, improving reviewer trust and acceptance rates.
- Product road‑mapping: Companies building AI coding assistants can use these baselines to benchmark their models, focusing on improving structural change suggestions to close the acceptance gap.
Limitations & Future Work
- Dataset bias: The study only covers public GitHub repositories that voluntarily disclose AI usage, potentially missing private or undisclosed AI contributions.
- Agent granularity: Some PRs list multiple agents or generic “AI assistant,” making it hard to attribute performance to a single model.
- Temporal horizon: The analysis stops at late‑2025; rapid model releases after that point could shift trends dramatically.
- Human factors: The paper does not deeply explore reviewer expertise or project maturity, which could mediate acceptance decisions.
Future research could expand to other mobile ecosystems (e.g., Flutter, React Native), incorporate sentiment analysis of reviewer comments, and experiment with hybrid human‑AI review pipelines to quantify productivity gains.
Authors
- Muhammad Ahmad Khan
- Hasnain Ali
- Muneeb Rana
- Muhammad Saqib Ilyas
- Abdul Ali Bangash
Paper Information
- arXiv ID: 2602.12144v1
- Categories: cs.SE, cs.AI
- Published: February 12, 2026
- PDF: Download PDF