[Paper] SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

Published: 1 day ago (March 4, 2026 at 03:20 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2603.03823v1

Overview

The paper introduces SWE‑CI, a new benchmark that evaluates how well LLM‑powered coding agents can maintain real‑world codebases over time.
Instead of measuring a single “does this patch compile?” moment, SWE‑CI forces agents to work through weeks‑long development histories, mimicking the continuous‑integration (CI) cycles that software teams live with every day.

Key Contributions

First repository‑level CI benchmark – 100 realistic tasks drawn from open‑source projects, each covering an average of 233 days and 71 commits.
Long‑term maintainability focus – Shifts evaluation from one‑shot functional correctness to sustained code quality across many iterative changes.
Multi‑round interaction protocol – Agents must perform repeated analysis, coding, testing, and debugging steps, mirroring real CI pipelines.
Comprehensive metrics suite – Includes build success rate, test‑suite pass ratio, code‑style compliance, and regression‑induced defect count.
Open‑source benchmark suite and evaluation harness – Enables reproducible comparisons of existing and future LLM agents.

Methodology

Task selection – The authors mined popular GitHub repositories, extracting natural evolution windows where a new feature or bug‑fix was introduced and later refined. Each window becomes a benchmark task.
CI simulation environment – For every task, a Docker‑based CI pipeline is constructed (checkout, dependency install, test run, lint, build). The pipeline is exposed to the agent via a simple API.
Agent interaction loop – The agent receives the current repository state and a high‑level change request (e.g., “add pagination to the API”). It can:
- Run static analysis / tests,
- Propose code edits,
- Commit changes,
- Observe CI feedback,
- Iterate until the pipeline passes or a step limit is reached.
Evaluation metrics – Success is measured on several axes:
- Functional correctness (test suite pass),
- Build stability (no broken builds across iterations),
- Maintainability (code churn, cyclomatic complexity, lint violations),
- Regression safety (absence of newly introduced test failures).

The whole process is fully automated, allowing large‑scale comparison of different LLM agents (e.g., GPT‑4, Claude, CodeLlama) under identical conditions.

Results & Findings

Agent (model)	Avg. CI passes per task	Avg. test‑suite pass rate	Avg. regression defects*
GPT‑4 (code‑davinci)	4.2 / 10 rounds	78 %	0.9
Claude‑2	3.8 / 10 rounds	73 %	1.1
CodeLlama‑34B	2.5 / 10 rounds	61 %	1.8
Baseline (static patch)	1.0 / 10 rounds	45 %	2.4

*Number of new failing tests introduced during the agent’s iterations.

Key takeaways

Modern LLM agents can eventually drive a CI pipeline to green, but they often need several back‑and‑forth cycles—far more than the single‑shot fixes measured by older benchmarks.
Even the best agents still generate regressions in ~1 out of every 10 tasks, highlighting a gap in long‑term reasoning and dependency awareness.
Code quality metrics (e.g., cyclomatic complexity) degrade modestly across iterations, suggesting agents prioritize getting the build green over preserving architectural hygiene.

Practical Implications

Tooling for DevOps pipelines – SWE‑CI demonstrates that LLM agents can be integrated as “assistant bots” that automatically propose fixes when a CI job fails, reducing mean‑time‑to‑repair (MTTR).
Continuous code review augmentation – By exposing agents to the full commit history, teams can leverage them to suggest refactorings that respect existing design patterns, not just isolated patches.
On‑demand feature prototyping – Developers can hand a high‑level spec to an LLM agent, let it iterate through the CI loop, and obtain a production‑ready branch after a few automated cycles, accelerating sprint velocity.
Benchmark‑driven model selection – Companies can now evaluate LLM providers on maintainability metrics that matter in production, choosing models that minimize regression risk.

Limitations & Future Work

Scope of repositories – The benchmark currently focuses on Python and JavaScript projects; language‑specific nuances in compiled languages (e.g., C++) remain untested.
CI complexity – Real‑world pipelines often involve integration tests, performance benchmarks, and security scans that are not fully captured in the current harness.
Human‑in‑the‑loop – The study assumes fully autonomous agents; future work should explore hybrid workflows where developers intervene selectively.
Metric granularity – While the suite tracks build success and test pass rates, deeper architectural metrics (e.g., module coupling) could provide richer insights into long‑term maintainability.

By exposing these gaps, the authors set a clear agenda for the next generation of LLM‑powered development assistants—agents that not only write code, but also keep it healthy as software evolves.

Authors

Jialong Chen
Xander Xu
Hu Wei
Chuan Chen
Bing Zhao

Paper Information

arXiv ID: 2603.03823v1
Categories: cs.SE, cs.AI, cs.CL
Published: March 4, 2026
PDF: Download PDF

[Paper] SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

Overview

Key Contributions

Methodology

Results & Findings

Key takeaways

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation

[Paper] The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

[Paper] Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

[Paper] Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought