[Paper] What's in a Benchmark? The Case of SWE-Bench in Automated Program Repair

Published: (February 4, 2026 at 06:19 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.04449v1

Overview

The paper presents the first systematic audit of SWE‑Bench, the de‑facto benchmark for Automated Program Repair (APR) that uses real‑world Python bugs from popular open‑source projects. By dissecting the two public leaderboards—SWE‑Bench Lite and SWE‑Bench Verified—the authors reveal who is building the top‑performing repair tools, which language models they rely on, and how open or proprietary those solutions are. Their findings expose a strong industry tilt, a near‑monopoly of Claude‑family LLMs, and a surprisingly competitive academic presence.

Key Contributions

  • Comprehensive leaderboard analysis – examined 79 entries on Lite and 133 on Verified, covering submitter identity, company size, and model usage.
  • Industry dominance quantified – showed that small startups and large publicly traded firms together contribute the majority of high‑scoring entries.
  • LLM landscape mapping – identified Claude 4 Sonnet as the current state‑of‑the‑art model for APR on SWE‑Bench, with proprietary models vastly out‑performing open‑source alternatives.
  • Open‑source vs. proprietary trade‑off – highlighted that while academic, open‑source submissions remain competitive, they rarely top the leaderboards.
  • Transparency recommendations – offered concrete suggestions for benchmark designers and the APR community to encourage more diverse and reproducible research.

Methodology

  1. Data collection – scraped all public submissions from the two SWE‑Bench leaderboards, extracting metadata such as submitter name, affiliated organization, reported LLM, and whether the code was released under an open‑source license.
  2. Categorization – classified submitters into “industry” (further split into small companies, large public firms, and startups) and “academia”. LLMs were grouped as proprietary (e.g., Claude, GPT‑4) or open‑source (e.g., LLaMA, StarCoder).
  3. Statistical analysis – computed frequency distributions, median scores, and rank‑based performance gaps across categories.
  4. Qualitative review – inspected README files and accompanying papers to assess the openness of the approach (e.g., availability of model weights, inference pipelines).

All steps were performed using Python notebooks and visualized with seaborn/matplotlib, keeping the pipeline reproducible for future audits.

Results & Findings

  • Industry leads the pack – 68 % of Lite and 71 % of Verified submissions come from companies; within that, small firms (≤ 200 employees) account for ~45 % of top‑10 entries.
  • Claude 4 Sonnet dominates – it appears in 57 % of all submissions and holds the highest average repair score (0.73 on Lite, 0.68 on Verified).
  • Open‑source LLMs lag – the best open‑source model (StarCoder) achieves roughly 0.55 average score, a 15‑20 % gap versus Claude.
  • Academic entries are still viable – the highest‑ranking academic submission (using GPT‑4 with a custom prompt) placed 4th on Verified, showing that clever engineering can offset resource gaps.
  • Transparency is mixed – only 22 % of all entries provide full reproducible pipelines; the rest rely on proprietary APIs or undisclosed prompts.

Practical Implications

  • Tool builders should consider integrating Claude‑family APIs if they need state‑of‑the‑art repair performance, but must weigh cost and vendor lock‑in.
  • Open‑source advocates can focus on improving prompt engineering, retrieval‑augmented generation, or hybrid pipelines to close the performance gap without paying for proprietary models.
  • Benchmark designers might add “openness” as a secondary metric, encouraging submissions that publish prompts, model checkpoints, and evaluation scripts.
  • Product teams can use the paper’s taxonomy to benchmark their own APR pipelines against industry baselines, identifying whether they’re competing on raw performance or on transparency/reproducibility.
  • Investors and hiring managers get a data‑driven view of where APR talent is concentrated—primarily in small to mid‑size AI startups and large tech firms—informing recruitment strategies.

Limitations & Future Work

  • The analysis is limited to publicly visible leaderboard entries; private or internal APR experiments remain unaccounted for.
  • Performance metrics are tied to SWE‑Bench’s specific scoring function, which may not capture all dimensions of repair quality (e.g., runtime, maintainability).
  • The study does not evaluate the impact of prompt engineering depth, which could be a confounding factor behind the success of proprietary LLMs.
  • Future work could extend the audit to other languages (e.g., Java, JavaScript), incorporate longitudinal trends, and propose a standardized “openness score” for APR benchmarks.

Authors

  • Matias Martinez
  • Xavier Franch

Paper Information

  • arXiv ID: 2602.04449v1
  • Categories: cs.SE
  • Published: February 4, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »