[Paper] SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories

Published: (December 19, 2025 at 05:16 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.17419v1

Overview

SWE‑Bench++ is a new framework that automatically turns real pull requests from open‑source GitHub projects into reproducible, repository‑level coding tasks. By harvesting live bug‑fixes and feature requests across 11 programming languages, it creates a scalable, multilingual benchmark for evaluating large language models (LLMs) on realistic software‑engineering problems.

Key Contributions

  • Automated benchmark pipeline: End‑to‑end system that extracts, builds, and validates tasks from live GitHub PRs without manual curation.
  • Multilingual coverage: Generates tasks for 11 languages (Python, JavaScript, Java, Go, Rust, etc.), expanding beyond the Python‑centric focus of prior benchmarks.
  • Execution‑based evaluation: Each task includes an automatically synthesized environment and test oracle, enabling pass@k style metrics that reflect real build‑and‑test outcomes.
  • Hint‑guided trajectory synthesis: Converts hard instances (where strong models fail) into step‑by‑step “hints” that can be used for fine‑tuning or curriculum learning.
  • Large‑scale dataset: 11,133 instances from 3,971 repositories, with a curated subset of 1,782 instances used for thorough evaluation.
  • Demonstrated impact: Fine‑tuning on SWE‑Bench++ improves performance on the existing SWE‑Bench Multilingual benchmark, showing the data’s utility for model adaptation.

Methodology

  1. Programmatic Sourcing – The pipeline queries the GitHub API to collect recent merged PRs that contain either bug‑fix or feature‑addition changes.
  2. Environment Synthesis – For each repository, the system automatically resolves dependencies (e.g., requirements.txt, package.json, Cargo.toml) and builds a Docker container that mirrors the original development environment.
  3. Test Oracle Extraction – Existing CI test suites (e.g., pytest, JUnit, go test) are harvested and repurposed as ground‑truth oracles. The framework records the pre‑PR test results (failing) and the post‑PR results (passing).
  4. Quality Assurance – Automated checks verify that the PR can be cleanly applied, that tests are deterministic, and that the task is self‑contained (no external secrets).
  5. Hint‑Guided Trajectory Synthesis – For instances where top‑tier models (Claude‑Sonnet‑4.5, GPT‑5, Gemini‑2.5‑Pro) fail to pass, the system extracts intermediate commit diffs and comments to create a step‑wise “hint” sequence that can be used for curriculum‑style fine‑tuning.

The entire pipeline runs continuously, allowing the benchmark to stay up‑to‑date with the evolving open‑source ecosystem.

Results & Findings

  • Model performance (pass@10 on the 1,782‑instance subset):
    • Claude‑Sonnet‑4.5: 36.20 %
    • GPT‑5 (2025‑08‑07): 34.57 %
    • Gemini‑2.5‑Pro: 24.92 %
    • GPT‑4o: 16.89 %
  • Multilingual gap: All models perform noticeably worse on non‑Python languages, highlighting a current limitation in LLM code capabilities.
  • Fine‑tuning benefit: Models fine‑tuned on SWE‑Bench++ instances achieve measurable gains (up to ~5 % absolute improvement) on the separate SWE‑Bench Multilingual benchmark, confirming that the generated data is effective for adaptation.
  • Scalability proof: The pipeline produced >11k high‑quality tasks with minimal human oversight, demonstrating that benchmark growth can keep pace with the rapid expansion of open‑source codebases.

Practical Implications

  • More realistic evaluation: Developers can now benchmark LLMs against tasks that mirror actual pull‑request workflows—complete with build steps, dependency resolution, and test suites—rather than synthetic “toy” problems.
  • Multilingual tooling: Companies with polyglot codebases (e.g., micro‑services written in Go, Rust, and JavaScript) gain a benchmark that reflects their stack, enabling better model selection and targeted fine‑tuning.
  • Curriculum learning pipelines: The hint‑guided trajectories can be fed into continuous‑learning pipelines, allowing teams to iteratively improve model performance on the hardest real‑world bugs.
  • Open‑source contribution loop: Since the benchmark is derived from live PRs, improvements in model‑generated patches can be fed back to the originating repositories, potentially automating low‑risk contributions.
  • Tooling integration: The Docker‑based environment synthesis aligns with CI/CD pipelines, making it straightforward to plug the benchmark into existing testing infrastructure for automated regression testing of new LLM releases.

Limitations & Future Work

  • Quality variance: Not all PRs have comprehensive test suites; some tasks rely on minimal or flaky tests, which can affect evaluation reliability.
  • Language bias: Although 11 languages are covered, the distribution is still skewed toward popular ecosystems (Python, JavaScript). Rare languages may remain under‑represented.
  • Static analysis missing: The current pipeline does not incorporate static analysis or security checks, which are important for production‑grade code generation.
  • Future directions: The authors plan to (1) integrate richer static‑analysis oracles, (2) expand to additional languages and frameworks, (3) add a “human‑in‑the‑loop” verification stage for edge‑case PRs, and (4) open‑source the pipeline itself to enable community‑driven benchmark extensions.

Authors

  • Lilin Wang
  • Lucas Ramalho
  • Alan Celestino
  • Phuc Anthony Pham
  • Yu Liu
  • Umang Kumar Sinha
  • Andres Portillo
  • Onassis Osunwa
  • Gabriel Maduekwe

Paper Information

  • arXiv ID: 2512.17419v1
  • Categories: cs.SE, cs.AI, cs.CL, cs.LG
  • Published: December 19, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] When Reasoning Meets Its Laws

Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabil...