[Paper] SWE-RM: Execution-free Feedback For Software Engineering Agents

Published: 1 month ago (December 26, 2025 at 03:26 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.21919v1

Overview

The paper introduces SWE‑RM, a large‑scale “reward model” that can judge the quality of code generated by AI agents without actually running the code. By providing fine‑grained, execution‑free feedback, SWE‑RM makes it easier to improve coding assistants through test‑time scaling (TTS) and reinforcement learning (RL), while sidestepping the brittleness and sparsity of traditional unit‑test based signals.

Key Contributions

Execution‑free feedback: Demonstrates that a reward model can replace unit‑test feedback for software‑engineering (SWE) agents, delivering richer learning signals.
Comprehensive evaluation metrics: Shows that good TTS performance alone is insufficient for RL; introduces classification accuracy and calibration as essential RL‑ready metrics.
Controlled experiments: Systematically studies how data scale, policy mixtures, and source composition affect reward‑model robustness across the three metrics.
Mixture‑of‑Experts architecture: Proposes SWE‑RM, a 30 B‑parameter model with only 3 B active experts at inference time, achieving strong accuracy while remaining compute‑efficient.
State‑of‑the‑art results: Boosts open‑source coding agents (e.g., Qwen3‑Coder‑Flash, Qwen3‑Coder‑Max) by 10–7 percentage points on the SWE‑Bench Verified benchmark, surpassing all prior open models.

Methodology

Problem framing – Treat code generation as a sequential decision process. Instead of relying on binary pass/fail from unit tests, train a reward model that predicts a continuous quality score for any generated program.
Data collection – Gather a large corpus of (prompt, generated code, human‑rated quality) triples from multiple existing coding agents. The dataset mixes high‑quality and low‑quality samples to teach the model discrimination.
Model design – Use a Mixture‑of‑Experts (MoE) transformer: a shared backbone routes each token through a small subset of expert sub‑networks. At inference only 3 B parameters are active, keeping latency low.
Training objectives – Optimize three complementary losses:
- Ranking loss for TTS (select the best trajectory).
- Classification loss to improve binary correctness prediction (important for RL).
- Calibration loss (e.g., temperature scaling) to ensure the predicted scores reflect true probabilities.
Evaluation pipeline –
- TTS: Use the reward model to re‑rank multiple candidate programs and measure the top‑1 success rate on SWE‑Bench Verified.
- RL: Fine‑tune coding agents with PPO using the reward model as the critic, then assess the same benchmark.
- Metrics: Report accuracy, Expected Calibration Error (ECE), and RL‑specific reward curves.

Results & Findings

Model (baseline)	TTS Accuracy ↑	RL Accuracy ↑	Calibration (ECE ↓)
Qwen3‑Coder‑Flash	51.6 % → 62.0 %	+9.4 pp	0.18 → 0.11
Qwen3‑Coder‑Max	67.0 % → 74.6 %	+7.6 pp	0.14 → 0.07

TTS gains stem from better trajectory ranking; the model reliably picks the highest‑quality candidate among many.
RL improvements are larger than TTS gains, confirming that classification accuracy and calibration are critical for stable policy updates.
Ablation studies reveal:
- Scaling the reward‑model training data from 100 K to 1 M examples yields diminishing returns after ~500 K.
- Mixing policies (e.g., combining outputs from both high‑capacity and lightweight coders) improves robustness.
- Adding a small proportion of synthetic buggy code helps the model learn to penalize subtle errors.

Practical Implications

Faster iteration cycles: Developers can evaluate code suggestions instantly without spinning up test environments, dramatically reducing CI latency for AI‑assisted coding tools.
Better RL‑based fine‑tuning: Companies building proprietary coding assistants can now use SWE‑RM as a plug‑and‑play reward signal, achieving higher correctness with fewer RL steps.
Resource‑efficient deployment: The MoE design means the model fits on a single GPU (3 B active params) while retaining the knowledge of a 30 B model—ideal for SaaS platforms that need low‑latency inference.
Cross‑task generality: Because SWE‑RM learns from human quality judgments rather than specific test suites, it can be repurposed for related tasks such as code review, bug‑fix suggestion, or even documentation generation.

Limitations & Future Work

Domain coverage: The reward model is trained on Python‑centric benchmarks; performance on other languages (e.g., Rust, Go) remains untested.
Reliance on human labels: Quality judgments are subjective; bias in the annotation process could affect the model’s fairness across coding styles.
Calibration drift: Over long RL training runs, the reward model’s calibration may degrade, requiring periodic re‑calibration or online fine‑tuning.
Future directions suggested by the authors include: expanding the training corpus to multi‑language code, integrating static analysis tools as auxiliary signals, and exploring hierarchical MoE structures to further reduce inference cost.

Authors

KaShun Shum
Binyuan Hui
Jiawei Chen
Lei Zhang
X. W.
Jiaxi Yang
Yuzhen Huang
Junyang Lin
Junxian He

Paper Information

arXiv ID: 2512.21919v1
Categories: cs.CL
Published: December 26, 2025
PDF: Download PDF

[Paper] SWE-RM: Execution-free Feedback For Software Engineering Agents

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] A2P-Vis: an Analyzer-to-Presenter Agentic Pipeline for Visual Insights Generation and Reporting

[Paper] Introducing TrGLUE and SentiTurca: A Comprehensive Benchmark for Turkish General Language Understanding and Sentiment Analysis

[Paper] Unifying Learning Dynamics and Generalization in Transformers Scaling Law

[Paper] Context as a Tool: Context Management for Long-Horizon SWE-Agents