[Paper] SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories

Published: 1 month ago (December 19, 2025 at 05:16 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.17419v1

Overview

SWE‑Bench++ is a new framework that automatically turns real pull requests from open‑source GitHub projects into reproducible, repository‑level coding tasks. By harvesting live bug‑fixes and feature requests across 11 programming languages, it creates a scalable, multilingual benchmark for evaluating large language models (LLMs) on realistic software‑engineering problems.

Key Contributions

Automated benchmark pipeline: End‑to‑end system that extracts, builds, and validates tasks from live GitHub PRs without manual curation.
Multilingual coverage: Generates tasks for 11 languages (Python, JavaScript, Java, Go, Rust, etc.), expanding beyond the Python‑centric focus of prior benchmarks.
Execution‑based evaluation: Each task includes an automatically synthesized environment and test oracle, enabling pass@k style metrics that reflect real build‑and‑test outcomes.
Hint‑guided trajectory synthesis: Converts hard instances (where strong models fail) into step‑by‑step “hints” that can be used for fine‑tuning or curriculum learning.
Large‑scale dataset: 11,133 instances from 3,971 repositories, with a curated subset of 1,782 instances used for thorough evaluation.
Demonstrated impact: Fine‑tuning on SWE‑Bench++ improves performance on the existing SWE‑Bench Multilingual benchmark, showing the data’s utility for model adaptation.

Methodology

Programmatic Sourcing – The pipeline queries the GitHub API to collect recent merged PRs that contain either bug‑fix or feature‑addition changes.
Environment Synthesis – For each repository, the system automatically resolves dependencies (e.g., requirements.txt, package.json, Cargo.toml) and builds a Docker container that mirrors the original development environment.
Test Oracle Extraction – Existing CI test suites (e.g., pytest, JUnit, go test) are harvested and repurposed as ground‑truth oracles. The framework records the pre‑PR test results (failing) and the post‑PR results (passing).
Quality Assurance – Automated checks verify that the PR can be cleanly applied, that tests are deterministic, and that the task is self‑contained (no external secrets).
Hint‑Guided Trajectory Synthesis – For instances where top‑tier models (Claude‑Sonnet‑4.5, GPT‑5, Gemini‑2.5‑Pro) fail to pass, the system extracts intermediate commit diffs and comments to create a step‑wise “hint” sequence that can be used for curriculum‑style fine‑tuning.

The entire pipeline runs continuously, allowing the benchmark to stay up‑to‑date with the evolving open‑source ecosystem.

Results & Findings

Model performance (pass@10 on the 1,782‑instance subset):
- Claude‑Sonnet‑4.5: 36.20 %
- GPT‑5 (2025‑08‑07): 34.57 %
- Gemini‑2.5‑Pro: 24.92 %
- GPT‑4o: 16.89 %
Multilingual gap: All models perform noticeably worse on non‑Python languages, highlighting a current limitation in LLM code capabilities.
Fine‑tuning benefit: Models fine‑tuned on SWE‑Bench++ instances achieve measurable gains (up to ~5 % absolute improvement) on the separate SWE‑Bench Multilingual benchmark, confirming that the generated data is effective for adaptation.
Scalability proof: The pipeline produced >11k high‑quality tasks with minimal human oversight, demonstrating that benchmark growth can keep pace with the rapid expansion of open‑source codebases.

Practical Implications

More realistic evaluation: Developers can now benchmark LLMs against tasks that mirror actual pull‑request workflows—complete with build steps, dependency resolution, and test suites—rather than synthetic “toy” problems.
Multilingual tooling: Companies with polyglot codebases (e.g., micro‑services written in Go, Rust, and JavaScript) gain a benchmark that reflects their stack, enabling better model selection and targeted fine‑tuning.
Curriculum learning pipelines: The hint‑guided trajectories can be fed into continuous‑learning pipelines, allowing teams to iteratively improve model performance on the hardest real‑world bugs.
Open‑source contribution loop: Since the benchmark is derived from live PRs, improvements in model‑generated patches can be fed back to the originating repositories, potentially automating low‑risk contributions.
Tooling integration: The Docker‑based environment synthesis aligns with CI/CD pipelines, making it straightforward to plug the benchmark into existing testing infrastructure for automated regression testing of new LLM releases.

Limitations & Future Work

Quality variance: Not all PRs have comprehensive test suites; some tasks rely on minimal or flaky tests, which can affect evaluation reliability.
Language bias: Although 11 languages are covered, the distribution is still skewed toward popular ecosystems (Python, JavaScript). Rare languages may remain under‑represented.
Static analysis missing: The current pipeline does not incorporate static analysis or security checks, which are important for production‑grade code generation.
Future directions: The authors plan to (1) integrate richer static‑analysis oracles, (2) expand to additional languages and frameworks, (3) add a “human‑in‑the‑loop” verification stage for edge‑case PRs, and (4) open‑source the pipeline itself to enable community‑driven benchmark extensions.

Authors

Lilin Wang
Lucas Ramalho
Alan Celestino
Phuc Anthony Pham
Yu Liu
Umang Kumar Sinha
Andres Portillo
Onassis Osunwa
Gabriel Maduekwe

Paper Information

arXiv ID: 2512.17419v1
Categories: cs.SE, cs.AI, cs.CL, cs.LG
Published: December 19, 2025
PDF: Download PDF

[Paper] SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] When Reasoning Meets Its Laws

[Paper] ShareChat: A Dataset of Chatbot Conversations in the Wild

[Paper] Bangla MedER: Multi-BERT Ensemble Approach for the Recognition of Bangla Medical Entity

[Paper] AncientBench: Towards Comprehensive Evaluation on Excavated and Transmitted Chinese Corpora