[Paper] What's in a Benchmark? The Case of SWE-Bench in Automated Program Repair

Published: 4 days ago (February 4, 2026 at 06:19 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.04449v1

Overview

The paper presents the first systematic audit of SWE‑Bench, the de‑facto benchmark for Automated Program Repair (APR) that uses real‑world Python bugs from popular open‑source projects. By dissecting the two public leaderboards—SWE‑Bench Lite and SWE‑Bench Verified—the authors reveal who is building the top‑performing repair tools, which language models they rely on, and how open or proprietary those solutions are. Their findings expose a strong industry tilt, a near‑monopoly of Claude‑family LLMs, and a surprisingly competitive academic presence.

Key Contributions

Comprehensive leaderboard analysis – examined 79 entries on Lite and 133 on Verified, covering submitter identity, company size, and model usage.
Industry dominance quantified – showed that small startups and large publicly traded firms together contribute the majority of high‑scoring entries.
LLM landscape mapping – identified Claude 4 Sonnet as the current state‑of‑the‑art model for APR on SWE‑Bench, with proprietary models vastly out‑performing open‑source alternatives.
Open‑source vs. proprietary trade‑off – highlighted that while academic, open‑source submissions remain competitive, they rarely top the leaderboards.
Transparency recommendations – offered concrete suggestions for benchmark designers and the APR community to encourage more diverse and reproducible research.

Methodology

Data collection – scraped all public submissions from the two SWE‑Bench leaderboards, extracting metadata such as submitter name, affiliated organization, reported LLM, and whether the code was released under an open‑source license.
Categorization – classified submitters into “industry” (further split into small companies, large public firms, and startups) and “academia”. LLMs were grouped as proprietary (e.g., Claude, GPT‑4) or open‑source (e.g., LLaMA, StarCoder).
Statistical analysis – computed frequency distributions, median scores, and rank‑based performance gaps across categories.
Qualitative review – inspected README files and accompanying papers to assess the openness of the approach (e.g., availability of model weights, inference pipelines).

All steps were performed using Python notebooks and visualized with seaborn/matplotlib, keeping the pipeline reproducible for future audits.

Results & Findings

Industry leads the pack – 68 % of Lite and 71 % of Verified submissions come from companies; within that, small firms (≤ 200 employees) account for ~45 % of top‑10 entries.
Claude 4 Sonnet dominates – it appears in 57 % of all submissions and holds the highest average repair score (0.73 on Lite, 0.68 on Verified).
Open‑source LLMs lag – the best open‑source model (StarCoder) achieves roughly 0.55 average score, a 15‑20 % gap versus Claude.
Academic entries are still viable – the highest‑ranking academic submission (using GPT‑4 with a custom prompt) placed 4th on Verified, showing that clever engineering can offset resource gaps.
Transparency is mixed – only 22 % of all entries provide full reproducible pipelines; the rest rely on proprietary APIs or undisclosed prompts.

Practical Implications

Tool builders should consider integrating Claude‑family APIs if they need state‑of‑the‑art repair performance, but must weigh cost and vendor lock‑in.
Open‑source advocates can focus on improving prompt engineering, retrieval‑augmented generation, or hybrid pipelines to close the performance gap without paying for proprietary models.
Benchmark designers might add “openness” as a secondary metric, encouraging submissions that publish prompts, model checkpoints, and evaluation scripts.
Product teams can use the paper’s taxonomy to benchmark their own APR pipelines against industry baselines, identifying whether they’re competing on raw performance or on transparency/reproducibility.
Investors and hiring managers get a data‑driven view of where APR talent is concentrated—primarily in small to mid‑size AI startups and large tech firms—informing recruitment strategies.

Limitations & Future Work

The analysis is limited to publicly visible leaderboard entries; private or internal APR experiments remain unaccounted for.
Performance metrics are tied to SWE‑Bench’s specific scoring function, which may not capture all dimensions of repair quality (e.g., runtime, maintainability).
The study does not evaluate the impact of prompt engineering depth, which could be a confounding factor behind the success of proprietary LLMs.
Future work could extend the audit to other languages (e.g., Java, JavaScript), incorporate longitudinal trends, and propose a standardized “openness score” for APR benchmarks.

Authors

Matias Martinez
Xavier Franch

Paper Information

arXiv ID: 2602.04449v1
Categories: cs.SE
Published: February 4, 2026
PDF: Download PDF

[Paper] What's in a Benchmark? The Case of SWE-Bench in Automated Program Repair

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Characterizing and Modeling the GitHub Security Advisories Review Pipeline

[Paper] When Elo Lies: Hidden Biases in Codeforces-Based Evaluation of Large Language Models

[Paper] Toward Quantum-Safe Software Engineering: A Vision for Post-Quantum Cryptography Migration

[Paper] A Bayesian Optimization-Based AutoML Framework for Non-Intrusive Load Monitoring