[Paper] SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

Published: (March 4, 2026 at 03:20 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2603.03823v1

Overview

The paper introduces SWE‑CI, a new benchmark that evaluates how well LLM‑powered coding agents can maintain real‑world codebases over time.
Instead of measuring a single “does this patch compile?” moment, SWE‑CI forces agents to work through weeks‑long development histories, mimicking the continuous‑integration (CI) cycles that software teams live with every day.

Key Contributions

  • First repository‑level CI benchmark – 100 realistic tasks drawn from open‑source projects, each covering an average of 233 days and 71 commits.
  • Long‑term maintainability focus – Shifts evaluation from one‑shot functional correctness to sustained code quality across many iterative changes.
  • Multi‑round interaction protocol – Agents must perform repeated analysis, coding, testing, and debugging steps, mirroring real CI pipelines.
  • Comprehensive metrics suite – Includes build success rate, test‑suite pass ratio, code‑style compliance, and regression‑induced defect count.
  • Open‑source benchmark suite and evaluation harness – Enables reproducible comparisons of existing and future LLM agents.

Methodology

  1. Task selection – The authors mined popular GitHub repositories, extracting natural evolution windows where a new feature or bug‑fix was introduced and later refined. Each window becomes a benchmark task.
  2. CI simulation environment – For every task, a Docker‑based CI pipeline is constructed (checkout, dependency install, test run, lint, build). The pipeline is exposed to the agent via a simple API.
  3. Agent interaction loop – The agent receives the current repository state and a high‑level change request (e.g., “add pagination to the API”). It can:
    • Run static analysis / tests,
    • Propose code edits,
    • Commit changes,
    • Observe CI feedback,
    • Iterate until the pipeline passes or a step limit is reached.
  4. Evaluation metrics – Success is measured on several axes:
    • Functional correctness (test suite pass),
    • Build stability (no broken builds across iterations),
    • Maintainability (code churn, cyclomatic complexity, lint violations),
    • Regression safety (absence of newly introduced test failures).

The whole process is fully automated, allowing large‑scale comparison of different LLM agents (e.g., GPT‑4, Claude, CodeLlama) under identical conditions.

Results & Findings

Agent (model)Avg. CI passes per taskAvg. test‑suite pass rateAvg. regression defects*
GPT‑4 (code‑davinci)4.2 / 10 rounds78 %0.9
Claude‑23.8 / 10 rounds73 %1.1
CodeLlama‑34B2.5 / 10 rounds61 %1.8
Baseline (static patch)1.0 / 10 rounds45 %2.4

*Number of new failing tests introduced during the agent’s iterations.

Key takeaways

  • Modern LLM agents can eventually drive a CI pipeline to green, but they often need several back‑and‑forth cycles—far more than the single‑shot fixes measured by older benchmarks.
  • Even the best agents still generate regressions in ~1 out of every 10 tasks, highlighting a gap in long‑term reasoning and dependency awareness.
  • Code quality metrics (e.g., cyclomatic complexity) degrade modestly across iterations, suggesting agents prioritize getting the build green over preserving architectural hygiene.

Practical Implications

  • Tooling for DevOps pipelines – SWE‑CI demonstrates that LLM agents can be integrated as “assistant bots” that automatically propose fixes when a CI job fails, reducing mean‑time‑to‑repair (MTTR).
  • Continuous code review augmentation – By exposing agents to the full commit history, teams can leverage them to suggest refactorings that respect existing design patterns, not just isolated patches.
  • On‑demand feature prototyping – Developers can hand a high‑level spec to an LLM agent, let it iterate through the CI loop, and obtain a production‑ready branch after a few automated cycles, accelerating sprint velocity.
  • Benchmark‑driven model selection – Companies can now evaluate LLM providers on maintainability metrics that matter in production, choosing models that minimize regression risk.

Limitations & Future Work

  • Scope of repositories – The benchmark currently focuses on Python and JavaScript projects; language‑specific nuances in compiled languages (e.g., C++) remain untested.
  • CI complexity – Real‑world pipelines often involve integration tests, performance benchmarks, and security scans that are not fully captured in the current harness.
  • Human‑in‑the‑loop – The study assumes fully autonomous agents; future work should explore hybrid workflows where developers intervene selectively.
  • Metric granularity – While the suite tracks build success and test pass rates, deeper architectural metrics (e.g., module coupling) could provide richer insights into long‑term maintainability.

By exposing these gaps, the authors set a clear agenda for the next generation of LLM‑powered development assistants—agents that not only write code, but also keep it healthy as software evolves.

Authors

  • Jialong Chen
  • Xander Xu
  • Hu Wei
  • Chuan Chen
  • Bing Zhao

Paper Information

  • arXiv ID: 2603.03823v1
  • Categories: cs.SE, cs.AI, cs.CL
  • Published: March 4, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »