[Paper] SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration
Source: arXiv - 2603.03823v1
Overview
The paper introduces SWE‑CI, a new benchmark that evaluates how well LLM‑powered coding agents can maintain real‑world codebases over time.
Instead of measuring a single “does this patch compile?” moment, SWE‑CI forces agents to work through weeks‑long development histories, mimicking the continuous‑integration (CI) cycles that software teams live with every day.
Key Contributions
- First repository‑level CI benchmark – 100 realistic tasks drawn from open‑source projects, each covering an average of 233 days and 71 commits.
- Long‑term maintainability focus – Shifts evaluation from one‑shot functional correctness to sustained code quality across many iterative changes.
- Multi‑round interaction protocol – Agents must perform repeated analysis, coding, testing, and debugging steps, mirroring real CI pipelines.
- Comprehensive metrics suite – Includes build success rate, test‑suite pass ratio, code‑style compliance, and regression‑induced defect count.
- Open‑source benchmark suite and evaluation harness – Enables reproducible comparisons of existing and future LLM agents.
Methodology
- Task selection – The authors mined popular GitHub repositories, extracting natural evolution windows where a new feature or bug‑fix was introduced and later refined. Each window becomes a benchmark task.
- CI simulation environment – For every task, a Docker‑based CI pipeline is constructed (checkout, dependency install, test run, lint, build). The pipeline is exposed to the agent via a simple API.
- Agent interaction loop – The agent receives the current repository state and a high‑level change request (e.g., “add pagination to the API”). It can:
- Run static analysis / tests,
- Propose code edits,
- Commit changes,
- Observe CI feedback,
- Iterate until the pipeline passes or a step limit is reached.
- Evaluation metrics – Success is measured on several axes:
- Functional correctness (test suite pass),
- Build stability (no broken builds across iterations),
- Maintainability (code churn, cyclomatic complexity, lint violations),
- Regression safety (absence of newly introduced test failures).
The whole process is fully automated, allowing large‑scale comparison of different LLM agents (e.g., GPT‑4, Claude, CodeLlama) under identical conditions.
Results & Findings
| Agent (model) | Avg. CI passes per task | Avg. test‑suite pass rate | Avg. regression defects* |
|---|---|---|---|
| GPT‑4 (code‑davinci) | 4.2 / 10 rounds | 78 % | 0.9 |
| Claude‑2 | 3.8 / 10 rounds | 73 % | 1.1 |
| CodeLlama‑34B | 2.5 / 10 rounds | 61 % | 1.8 |
| Baseline (static patch) | 1.0 / 10 rounds | 45 % | 2.4 |
*Number of new failing tests introduced during the agent’s iterations.
Key takeaways
- Modern LLM agents can eventually drive a CI pipeline to green, but they often need several back‑and‑forth cycles—far more than the single‑shot fixes measured by older benchmarks.
- Even the best agents still generate regressions in ~1 out of every 10 tasks, highlighting a gap in long‑term reasoning and dependency awareness.
- Code quality metrics (e.g., cyclomatic complexity) degrade modestly across iterations, suggesting agents prioritize getting the build green over preserving architectural hygiene.
Practical Implications
- Tooling for DevOps pipelines – SWE‑CI demonstrates that LLM agents can be integrated as “assistant bots” that automatically propose fixes when a CI job fails, reducing mean‑time‑to‑repair (MTTR).
- Continuous code review augmentation – By exposing agents to the full commit history, teams can leverage them to suggest refactorings that respect existing design patterns, not just isolated patches.
- On‑demand feature prototyping – Developers can hand a high‑level spec to an LLM agent, let it iterate through the CI loop, and obtain a production‑ready branch after a few automated cycles, accelerating sprint velocity.
- Benchmark‑driven model selection – Companies can now evaluate LLM providers on maintainability metrics that matter in production, choosing models that minimize regression risk.
Limitations & Future Work
- Scope of repositories – The benchmark currently focuses on Python and JavaScript projects; language‑specific nuances in compiled languages (e.g., C++) remain untested.
- CI complexity – Real‑world pipelines often involve integration tests, performance benchmarks, and security scans that are not fully captured in the current harness.
- Human‑in‑the‑loop – The study assumes fully autonomous agents; future work should explore hybrid workflows where developers intervene selectively.
- Metric granularity – While the suite tracks build success and test pass rates, deeper architectural metrics (e.g., module coupling) could provide richer insights into long‑term maintainability.
By exposing these gaps, the authors set a clear agenda for the next generation of LLM‑powered development assistants—agents that not only write code, but also keep it healthy as software evolves.
Authors
- Jialong Chen
- Xander Xu
- Hu Wei
- Chuan Chen
- Bing Zhao
Paper Information
- arXiv ID: 2603.03823v1
- Categories: cs.SE, cs.AI, cs.CL
- Published: March 4, 2026
- PDF: Download PDF