[Paper] Improving Code Generation via Small Language Model-as-a-judge
Source: arXiv - 2602.11911v1
Overview
The paper investigates whether small language models (SLMs) can reliably act as “judges” that rank multiple generated code snippets, a role traditionally reserved for massive LLMs with tens of billions of parameters. By training modern SLMs to distinguish correct from incorrect implementations, the authors show that these lightweight models can match—or even surpass—the performance of larger, cost‑lier systems, opening a path for companies to build high‑quality code generators without the heavyweight hardware bill.
Key Contributions
- Benchmarking SLM judges: Trains several state‑of‑the‑art SLMs (e.g., CodeT5, LLaMA‑7B) to classify generated code as correct or buggy.
- Classification accuracy analysis: Provides the first systematic measurement of how often a model mis‑judges a solution, filling a gap left by prior work (RankEF).
- Execution‑free ranking: Demonstrates that pure language‑model signals (syntax, semantics, token patterns) are enough to outperform the earlier RankEF approach, which relied on both execution‑based and static cues.
- Cost‑performance trade‑off: Shows that SLM‑based ranking achieves comparable results to LLMs that are 5–25× larger, while requiring a fraction of the compute and memory budget.
- Open‑source reproducibility: Releases training scripts, datasets, and model checkpoints, enabling practitioners to plug the judge into their own code‑generation pipelines.
Methodology
- Data collection – The authors gathered a large corpus of code generation tasks across multiple programming languages (including mainstream and niche DSLs). For each task, they generated N = 10 candidate solutions using a baseline generator.
- Labeling – Each candidate was automatically labeled correct or incorrect by running unit tests (when available) or by using static analysis heuristics.
- Judge training – Several SLMs were fine‑tuned on this labeled pool with a binary classification head. The training objective was simply to predict “correct vs. wrong” given the prompt + candidate code.
- Evaluation metrics –
- Classification accuracy (how often the judge gets the label right).
- Ranking quality measured by Mean Reciprocal Rank (MRR) and Top‑1 success rate when the judge is used to pick the best candidate among the N generated snippets.
- Baselines – Compared against the previously published RankEF (a T5‑based ranker that mixes execution info) and against large commercial LLMs (e.g., GPT‑4‑style models) used as black‑box rankers.
- Ablation – Tested the impact of adding execution‑based features to the SLM judge and of varying model size.
Results & Findings
| Model (size) | Classification Acc. | MRR (ranking) | Top‑1 Success |
|---|---|---|---|
| RankEF (T5‑base) | 71.3 % | 0.42 | 38 % |
| CodeT5‑small (220 M) | 78.9 % | 0.51 | 45 % |
| LLaMA‑7B | 84.2 % | 0.58 | 52 % |
| GPT‑4 (≈175 B) (black‑box) | 86.5 % | 0.61 | 55 % |
Key takeaways
- Modern SLMs outperform RankEF on both classification and ranking, even without any execution feedback.
- Adding execution information to the SLM judge yields only marginal gains (<2 % absolute), suggesting the language model already captures most of the needed correctness cues.
- The gap between a 7‑billion‑parameter SLM and a 175‑billion‑parameter commercial LLM is less than 5 % on the ranking metric, while the SLM runs on a single GPU and costs < $0.10 per inference batch.
- Across languages, the SLM judges maintain consistent performance, indicating robustness to domain‑specific syntax.
Practical Implications
- Cheaper in‑house code generators: Companies can train a modest‑size SLM as a judge and pair it with any existing generator (e.g., Codex, open‑source CodeGen). The combined system rivals the output quality of far larger, proprietary models.
- Fast feedback loops: Because the judge runs inference‑only and does not need to compile or execute code, it can be integrated into IDE extensions, CI pipelines, or pull‑request bots for real‑time ranking of generated snippets.
- Support for niche languages: Organizations that rely on DSLs or legacy languages can fine‑tune a small judge on their own test suites, achieving high correctness discrimination without the massive data requirements of a full LLM.
- Resource‑constrained environments: Edge devices, CI runners, or low‑budget startups can now afford a “generate‑and‑rank” workflow on a single GPU or even CPU‑only inference, dramatically lowering the barrier to adopt AI‑assisted coding.
- Foundation for automated code review: The binary correctness signal can be repurposed as a lightweight static analysis tool, flagging potentially buggy suggestions before they reach a human reviewer.
Limitations & Future Work
- Reliance on test‑derived labels: The ground‑truth correctness comes from unit tests or heuristics, which may not capture subtle logical bugs; the judge inherits this bias.
- Scalability to extremely large candidate sets: The study evaluated up to 10 candidates per prompt; performance when ranking dozens or hundreds of snippets remains unexplored.
- Domain shift: While the models generalize across several languages, extreme domain‑specific vocabularies (e.g., hardware description languages) could still challenge a small judge.
- Future directions suggested by the authors include:
- Incorporating self‑supervised contrastive learning to better separate correct/incorrect code.
- Evaluating the judge in interactive coding assistants where the user can provide feedback.
- Extending the approach to multi‑modal inputs (e.g., natural‑language specifications plus diagrams).
Authors
- Giuseppe Crupi
- Rosalia Tufano
- Gabriele Bavota
Paper Information
- arXiv ID: 2602.11911v1
- Categories: cs.SE
- Published: February 12, 2026
- PDF: Download PDF