[Paper] Security in the Age of AI Teammates: An Empirical Study of Agentic Pull Requests on GitHub
Source: arXiv - 2601.00477v1
Overview
The paper Security in the Age of AI Teammates examines how autonomous coding agents—think GitHub Copilot‑style bots that open pull requests (PRs) on their own—affect software security in real‑world projects. By mining over 33 k agent‑authored PRs from popular repositories, the authors quantify the prevalence of security‑related changes, how often they get merged, and what factors tip the scales toward acceptance or rejection.
Key Contributions
- Large‑scale empirical dataset: Curated 33 k agent‑authored PRs (AIDev) and identified 1 293 verified security‑related PRs.
- Taxonomy of security actions: Open‑coded a set of recurring security‑related intents (e.g., hardening tests, config tweaks, error‑handling improvements).
- Acceptance analysis: Showed security‑focused agent PRs make up ~4 % of all agent activity but have lower merge rates and longer review times than non‑security PRs.
- Signal detection for rejection: Found PR complexity (size, number of files changed) and verbosity are stronger predictors of rejection than the specific security topic.
- Cross‑ecosystem insights: Compared behavior across major languages (JavaScript, Python, Java, etc.) and highlighted ecosystem‑specific patterns.
Methodology
- Data collection – Leveraged the public AIDev dataset, which tracks PRs authored by known autonomous agents (e.g., GitHub Copilot, CodeWhisperer, Tabnine).
- Security PR identification – Applied a keyword filter (e.g., “security”, “vulnerability”, “hardening”) to the PR titles, bodies, and changed files, then manually validated each candidate to eliminate false positives.
- Quantitative analysis – Measured prevalence, merge ratio, and review latency, stratified by agent, programming language, and change type (test, config, code, docs).
- Qualitative coding – Performed open coding on a random sample of security PRs to build a taxonomy of security intents.
- Signal mining – Extracted PR metadata (lines added/removed, number of files, comment count, reviewer count) and used statistical tests to correlate these with merge outcomes.
Results & Findings
- Security PR share: ~4 % of all autonomous PRs target security, indicating agents are already contributing beyond trivial syntax fixes.
- Dominant actions: The most common security‑related contributions are supportive—adding tests, updating docs, tweaking configurations, and improving error handling—rather than direct vulnerability patches.
- Merge outcomes: Only 58 % of security PRs were merged versus 71 % of non‑security PRs.
- Review latency: Security PRs linger ~30 % longer in review queues, reflecting extra human scrutiny.
- Rejection predictors: Larger diff size, higher file count, and verbose commit messages correlate strongly with rejection; the specific security keyword (e.g., “XSS”) has a weaker effect.
- Ecosystem variance: Python and JavaScript agents produce a higher proportion of security PRs, while Java agents see higher merge rates, likely due to stricter CI pipelines in the Java ecosystem.
Practical Implications
- Tooling teams should surface complexity metrics (diff size, file count) early in the PR UI for AI‑generated PRs, prompting agents to break large changes into smaller, review‑friendly chunks.
- CI/CD pipelines can auto‑tag agent PRs that touch security‑sensitive files (e.g.,
security.yml,Dockerfile) for expedited security review, balancing speed with safety. - Developers can trust agents to handle routine hardening tasks (adding tests, updating docs) but should still manually verify any code that alters authentication logic or cryptographic primitives.
- Product managers might prioritize integrating agents that excel in supportive security work, freeing human engineers to focus on high‑impact vulnerability remediation.
- Open‑source maintainers can adopt a “sandbox” branch for AI‑generated PRs, allowing automated linting and static analysis before human review, reducing latency.
Limitations & Future Work
- Keyword‑based filtering may miss security PRs that use unconventional terminology, potentially under‑estimating the true volume.
- The study focuses on popular public repositories; results might differ in enterprise or highly regulated codebases where review policies are stricter.
- Manual validation, while thorough, limits scalability—future work could explore machine‑learning classifiers to flag security‑relevant changes.
- The authors suggest extending the analysis to post‑merge security outcomes (e.g., whether agent‑added tests actually catch bugs) and to newer agents that generate multi‑file refactorings.
Authors
- Mohammed Latif Siddiq
- Xinye Zhao
- Vinicius Carvalho Lopes
- Beatrice Casey
- Joanna C. S. Santos
Paper Information
- arXiv ID: 2601.00477v1
- Categories: cs.CR, cs.SE
- Published: January 1, 2026
- PDF: Download PDF