[Paper] SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories
Source: arXiv - 2512.17419v1
Overview
SWE‑Bench++ is a new framework that automatically turns real pull requests from open‑source GitHub projects into reproducible, repository‑level coding tasks. By harvesting live bug‑fixes and feature requests across 11 programming languages, it creates a scalable, multilingual benchmark for evaluating large language models (LLMs) on realistic software‑engineering problems.
Key Contributions
- Automated benchmark pipeline: End‑to‑end system that extracts, builds, and validates tasks from live GitHub PRs without manual curation.
- Multilingual coverage: Generates tasks for 11 languages (Python, JavaScript, Java, Go, Rust, etc.), expanding beyond the Python‑centric focus of prior benchmarks.
- Execution‑based evaluation: Each task includes an automatically synthesized environment and test oracle, enabling pass@k style metrics that reflect real build‑and‑test outcomes.
- Hint‑guided trajectory synthesis: Converts hard instances (where strong models fail) into step‑by‑step “hints” that can be used for fine‑tuning or curriculum learning.
- Large‑scale dataset: 11,133 instances from 3,971 repositories, with a curated subset of 1,782 instances used for thorough evaluation.
- Demonstrated impact: Fine‑tuning on SWE‑Bench++ improves performance on the existing SWE‑Bench Multilingual benchmark, showing the data’s utility for model adaptation.
Methodology
- Programmatic Sourcing – The pipeline queries the GitHub API to collect recent merged PRs that contain either bug‑fix or feature‑addition changes.
- Environment Synthesis – For each repository, the system automatically resolves dependencies (e.g.,
requirements.txt,package.json,Cargo.toml) and builds a Docker container that mirrors the original development environment. - Test Oracle Extraction – Existing CI test suites (e.g.,
pytest,JUnit,go test) are harvested and repurposed as ground‑truth oracles. The framework records the pre‑PR test results (failing) and the post‑PR results (passing). - Quality Assurance – Automated checks verify that the PR can be cleanly applied, that tests are deterministic, and that the task is self‑contained (no external secrets).
- Hint‑Guided Trajectory Synthesis – For instances where top‑tier models (Claude‑Sonnet‑4.5, GPT‑5, Gemini‑2.5‑Pro) fail to pass, the system extracts intermediate commit diffs and comments to create a step‑wise “hint” sequence that can be used for curriculum‑style fine‑tuning.
The entire pipeline runs continuously, allowing the benchmark to stay up‑to‑date with the evolving open‑source ecosystem.
Results & Findings
- Model performance (pass@10 on the 1,782‑instance subset):
- Claude‑Sonnet‑4.5: 36.20 %
- GPT‑5 (2025‑08‑07): 34.57 %
- Gemini‑2.5‑Pro: 24.92 %
- GPT‑4o: 16.89 %
- Multilingual gap: All models perform noticeably worse on non‑Python languages, highlighting a current limitation in LLM code capabilities.
- Fine‑tuning benefit: Models fine‑tuned on SWE‑Bench++ instances achieve measurable gains (up to ~5 % absolute improvement) on the separate SWE‑Bench Multilingual benchmark, confirming that the generated data is effective for adaptation.
- Scalability proof: The pipeline produced >11k high‑quality tasks with minimal human oversight, demonstrating that benchmark growth can keep pace with the rapid expansion of open‑source codebases.
Practical Implications
- More realistic evaluation: Developers can now benchmark LLMs against tasks that mirror actual pull‑request workflows—complete with build steps, dependency resolution, and test suites—rather than synthetic “toy” problems.
- Multilingual tooling: Companies with polyglot codebases (e.g., micro‑services written in Go, Rust, and JavaScript) gain a benchmark that reflects their stack, enabling better model selection and targeted fine‑tuning.
- Curriculum learning pipelines: The hint‑guided trajectories can be fed into continuous‑learning pipelines, allowing teams to iteratively improve model performance on the hardest real‑world bugs.
- Open‑source contribution loop: Since the benchmark is derived from live PRs, improvements in model‑generated patches can be fed back to the originating repositories, potentially automating low‑risk contributions.
- Tooling integration: The Docker‑based environment synthesis aligns with CI/CD pipelines, making it straightforward to plug the benchmark into existing testing infrastructure for automated regression testing of new LLM releases.
Limitations & Future Work
- Quality variance: Not all PRs have comprehensive test suites; some tasks rely on minimal or flaky tests, which can affect evaluation reliability.
- Language bias: Although 11 languages are covered, the distribution is still skewed toward popular ecosystems (Python, JavaScript). Rare languages may remain under‑represented.
- Static analysis missing: The current pipeline does not incorporate static analysis or security checks, which are important for production‑grade code generation.
- Future directions: The authors plan to (1) integrate richer static‑analysis oracles, (2) expand to additional languages and frameworks, (3) add a “human‑in‑the‑loop” verification stage for edge‑case PRs, and (4) open‑source the pipeline itself to enable community‑driven benchmark extensions.
Authors
- Lilin Wang
- Lucas Ramalho
- Alan Celestino
- Phuc Anthony Pham
- Yu Liu
- Umang Kumar Sinha
- Andres Portillo
- Onassis Osunwa
- Gabriel Maduekwe
Paper Information
- arXiv ID: 2512.17419v1
- Categories: cs.SE, cs.AI, cs.CL, cs.LG
- Published: December 19, 2025
- PDF: Download PDF