[Paper] SWE-RM: Execution-free Feedback For Software Engineering Agents

Published: (December 26, 2025 at 03:26 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.21919v1

Overview

The paper introduces SWE‑RM, a large‑scale “reward model” that can judge the quality of code generated by AI agents without actually running the code. By providing fine‑grained, execution‑free feedback, SWE‑RM makes it easier to improve coding assistants through test‑time scaling (TTS) and reinforcement learning (RL), while sidestepping the brittleness and sparsity of traditional unit‑test based signals.

Key Contributions

  • Execution‑free feedback: Demonstrates that a reward model can replace unit‑test feedback for software‑engineering (SWE) agents, delivering richer learning signals.
  • Comprehensive evaluation metrics: Shows that good TTS performance alone is insufficient for RL; introduces classification accuracy and calibration as essential RL‑ready metrics.
  • Controlled experiments: Systematically studies how data scale, policy mixtures, and source composition affect reward‑model robustness across the three metrics.
  • Mixture‑of‑Experts architecture: Proposes SWE‑RM, a 30 B‑parameter model with only 3 B active experts at inference time, achieving strong accuracy while remaining compute‑efficient.
  • State‑of‑the‑art results: Boosts open‑source coding agents (e.g., Qwen3‑Coder‑Flash, Qwen3‑Coder‑Max) by 10–7 percentage points on the SWE‑Bench Verified benchmark, surpassing all prior open models.

Methodology

  1. Problem framing – Treat code generation as a sequential decision process. Instead of relying on binary pass/fail from unit tests, train a reward model that predicts a continuous quality score for any generated program.
  2. Data collection – Gather a large corpus of (prompt, generated code, human‑rated quality) triples from multiple existing coding agents. The dataset mixes high‑quality and low‑quality samples to teach the model discrimination.
  3. Model design – Use a Mixture‑of‑Experts (MoE) transformer: a shared backbone routes each token through a small subset of expert sub‑networks. At inference only 3 B parameters are active, keeping latency low.
  4. Training objectives – Optimize three complementary losses:
    • Ranking loss for TTS (select the best trajectory).
    • Classification loss to improve binary correctness prediction (important for RL).
    • Calibration loss (e.g., temperature scaling) to ensure the predicted scores reflect true probabilities.
  5. Evaluation pipeline
    • TTS: Use the reward model to re‑rank multiple candidate programs and measure the top‑1 success rate on SWE‑Bench Verified.
    • RL: Fine‑tune coding agents with PPO using the reward model as the critic, then assess the same benchmark.
    • Metrics: Report accuracy, Expected Calibration Error (ECE), and RL‑specific reward curves.

Results & Findings

Model (baseline)TTS Accuracy ↑RL Accuracy ↑Calibration (ECE ↓)
Qwen3‑Coder‑Flash51.6 % → 62.0 %+9.4 pp0.18 → 0.11
Qwen3‑Coder‑Max67.0 % → 74.6 %+7.6 pp0.14 → 0.07
  • TTS gains stem from better trajectory ranking; the model reliably picks the highest‑quality candidate among many.
  • RL improvements are larger than TTS gains, confirming that classification accuracy and calibration are critical for stable policy updates.
  • Ablation studies reveal:
    • Scaling the reward‑model training data from 100 K to 1 M examples yields diminishing returns after ~500 K.
    • Mixing policies (e.g., combining outputs from both high‑capacity and lightweight coders) improves robustness.
    • Adding a small proportion of synthetic buggy code helps the model learn to penalize subtle errors.

Practical Implications

  • Faster iteration cycles: Developers can evaluate code suggestions instantly without spinning up test environments, dramatically reducing CI latency for AI‑assisted coding tools.
  • Better RL‑based fine‑tuning: Companies building proprietary coding assistants can now use SWE‑RM as a plug‑and‑play reward signal, achieving higher correctness with fewer RL steps.
  • Resource‑efficient deployment: The MoE design means the model fits on a single GPU (3 B active params) while retaining the knowledge of a 30 B model—ideal for SaaS platforms that need low‑latency inference.
  • Cross‑task generality: Because SWE‑RM learns from human quality judgments rather than specific test suites, it can be repurposed for related tasks such as code review, bug‑fix suggestion, or even documentation generation.

Limitations & Future Work

  • Domain coverage: The reward model is trained on Python‑centric benchmarks; performance on other languages (e.g., Rust, Go) remains untested.
  • Reliance on human labels: Quality judgments are subjective; bias in the annotation process could affect the model’s fairness across coding styles.
  • Calibration drift: Over long RL training runs, the reward model’s calibration may degrade, requiring periodic re‑calibration or online fine‑tuning.
  • Future directions suggested by the authors include: expanding the training corpus to multi‑language code, integrating static analysis tools as auxiliary signals, and exploring hierarchical MoE structures to further reduce inference cost.

Authors

  • KaShun Shum
  • Binyuan Hui
  • Jiawei Chen
  • Lei Zhang
  • X. W.
  • Jiaxi Yang
  • Yuzhen Huang
  • Junyang Lin
  • Junxian He

Paper Information

  • arXiv ID: 2512.21919v1
  • Categories: cs.CL
  • Published: December 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »