[Paper] SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale

Published: (February 27, 2026 at 05:06 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.23866v1

Overview

The paper presents SWE‑rebench V2, a new, language‑agnostic pipeline that automatically harvests real‑world software‑engineering (SWE) tasks from open‑source repositories and turns them into ready‑to‑run reinforcement‑learning (RL) environments. By scaling up both the number of tasks (tens of thousands) and the variety of programming languages (20 +), the authors aim to give developers of AI‑powered coding assistants a richer training ground than ever before.

Key Contributions

  • Automated, language‑agnostic collection pipeline that extracts install scripts, test suites, and problem statements from any GitHub repository.
  • LLM‑based filtering using an ensemble of judges to discard noisy or unsolvable instances, validated against human‑annotated benchmarks.
  • Large‑scale dataset:
    • 32 k+ high‑quality tasks with reproducible Docker images covering 20 languages and 3.6 k repositories.
    • 120 k+ additional tasks with installation instructions and metadata (no pre‑built images).
  • Open‑source release of the datasets, the harvesting code, and the execution infrastructure, enabling anyone to reproduce or extend the benchmark.
  • Diagnostic evaluation across five languages and seven popular LLMs, exposing common confounders like overly strict tests or vague descriptions.

Methodology

  1. Repository Mining – The pipeline crawls GitHub for recent pull‑request (PR) merges that contain a clear description and an associated test suite.
  2. Interactive Setup Agent – For each candidate repo, a lightweight agent attempts to install the project and run its tests, automatically generating Dockerfiles and scripts that capture the exact environment (OS, dependencies, build tools).
  3. LLM Judging Ensemble – Multiple large language models are prompted to assess whether the extracted task is well‑posed (e.g., does the test actually verify the PR change?). Their votes are aggregated; only tasks with consensus pass to the final set.
  4. Human Validation – A subset of the filtered tasks is cross‑checked against the existing SWE‑bench annotations to ensure the LLM judges are not drifting.
  5. Metadata Enrichment – Each task is annotated with language, repository, test pass/fail status, and flags for known issues (e.g., flaky tests, ambiguous problem statements).
  6. Dataset Packaging – High‑quality tasks are shipped with pre‑built Docker images; the larger, lighter set includes just the install scripts and metadata for on‑the‑fly image construction.

Results & Findings

  • Scale: The pipeline produced 32 k+ fully reproducible tasks (≈ 10 × the size of prior SWE‑bench releases) and 120 k+ “light” tasks.
  • Language Diversity: Tasks span 20 programming languages, from mainstream (Python, JavaScript, Java) to niche (Rust, Haskell, Elixir).
  • Quality: Human‑verified sampling shows ≈ 92 % of the filtered tasks are solvable and have meaningful test suites, matching or exceeding the quality of manually curated benchmarks.
  • Model Diagnostics: When evaluated on a representative slice, state‑of‑the‑art LLMs (e.g., GPT‑4, Claude‑2) still struggle with many tasks, especially those with overly restrictive tests or underspecified PR descriptions, highlighting gaps that larger, more diverse training data could help close.

Practical Implications

  • Richer Training Data for Code‑Gen Agents – Developers building RL‑based code assistants can now train on a dataset that mirrors the heterogeneity of real‑world projects, potentially improving cross‑language generalization.
  • Benchmarking Across Ecosystems – The language‑agnostic nature lets teams evaluate their models on languages they previously ignored, uncovering hidden weaknesses.
  • Faster Prototyping – Pre‑built Docker images mean you can spin up a task in seconds, dramatically reducing the engineering overhead of creating custom RL environments.
  • Better Test‑Driven Evaluation – With detailed metadata flagging flaky or overly strict tests, researchers can design more robust evaluation protocols that focus on genuine problem‑solving ability rather than test‑gaming.
  • Open‑source Ecosystem – By releasing the pipeline, other groups can extend the collection to private codebases or emerging languages, fostering a community‑driven benchmark ecosystem.

Limitations & Future Work

  • Reliance on Existing Test Suites – Projects without comprehensive tests are under‑represented, which may bias models toward well‑tested code patterns.
  • LLM Filtering Bias – The ensemble judges inherit the biases of the underlying LLMs; rare or unconventional tasks might be incorrectly discarded.
  • Flaky Tests Remain – Despite metadata flags, some tasks still contain nondeterministic test behavior that can confuse RL training.
  • Future Directions proposed by the authors include:
    1. Integrating static analysis to supplement missing tests.
    2. Expanding the pipeline to private enterprise repositories under controlled licensing.
    3. Exploring semi‑supervised methods to recover high‑quality tasks from noisy candidates.

Authors

  • Ibragim Badertdinov
  • Maksim Nekrashevich
  • Anton Shevtsov
  • Alexander Golubev

Paper Information

  • arXiv ID: 2602.23866v1
  • Categories: cs.SE, cs.CL
  • Published: February 27, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »