[Paper] SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents

Published: 3 days ago (February 25, 2026 at 12:11 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.22124v1

Overview

The paper presents SWE‑Protégé, a lightweight post‑training framework that lets a small language model (SLM) act as a software‑engineering agent by learning when and how to ask for help from a much stronger “expert” model. By treating software repair as a collaborative mentorship process, the authors dramatically boost the performance of a 7‑billion‑parameter model on the challenging SWE‑bench benchmark while keeping the cost and latency advantages of small models.

Key Contributions

Mentor‑Protégé Paradigm: Reframes software‑repair tasks as a selective collaboration between an SLM (protégé) and a powerful expert model, rather than a pure monolithic generation.
Sparse Expert Querying: Introduces a mechanism for the protégé to decide when to call the expert, achieving only ~4 expert calls per task (≈11 % of total tokens).
Dual‑Phase Training: Combines supervised fine‑tuning on expert‑augmented trajectories with a reinforcement‑learning (RL) stage that penalizes looping and unnecessary expert reliance.
State‑of‑the‑Art for SLMs: After light post‑training of Qwen2.5‑Coder‑7B‑Instruct, the model reaches 42.4 % Pass@1 on SWE‑bench Verified—a 25.4 % absolute gain over the previous best small‑model baseline.
Generalizable Framework: The approach is model‑agnostic and can be applied to any SLM that can be fine‑tuned, opening a path toward cost‑effective AI‑assisted development tools.

Methodology

Problem Reframing – The authors view each software‑repair episode as a sequence of states (code snapshots, test results, etc.). The SLM decides at each step whether to continue on its own or to request a suggestion from an expert LLM.
Data Generation – Expert‑augmented trajectories are created by running the expert model on a variety of repair tasks and recording the points where its intervention leads to progress. These trajectories serve as supervised targets.
Supervised Fine‑Tuning (SFT) – The SLM is first fine‑tuned on the expert‑augmented data, learning to imitate the expert’s advice while also learning to recognize “stalled” states that need help.
Agentic Reinforcement Learning – A reward model is built to encourage three behaviours: (a) task completion, (b) minimal expert calls, and (c) avoidance of action loops (repeating the same unproductive edit). The SLM is then trained with PPO‑style RL to maximize this reward.
Inference Policy – During deployment, the protégé runs a lightweight classifier at each step to decide: continue alone vs. query expert. If it queries, it appends the expert’s suggestion to its context and proceeds.

Results & Findings

Metric	Prior SLM (baseline)	SWE‑Protégé (7B)
Pass@1 on SWE‑bench Verified	~17 %	42.4 %
Expert calls per task	N/A (full expert)	~4
Expert token share	100 %	11 %
Looping incidents (degenerate repeats)	Frequent	Rare (explicitly penalized)

Performance Jump: The 25 % absolute improvement demonstrates that selective expert guidance can close much of the gap between small and large models.
Efficiency: Even with the extra expert calls, overall latency and cost remain far lower than running a giant model end‑to‑end.
Robustness: The RL stage successfully suppresses the notorious “action looping” problem that has plagued prior SLM attempts on long‑horizon coding tasks.

Practical Implications

Cost‑Effective AI Pair‑Programming: Development teams can deploy a modest‑size model locally (or on inexpensive cloud VMs) and still reap near‑state‑of‑the‑art repair capabilities, only invoking a heavyweight model when truly needed.
Low‑Latency IDE Assistants: Because the SLM does most of the heavy lifting, response times stay within interactive bounds, making the system suitable for real‑time code suggestions in editors.
Customizable Expertise: Organizations can swap in a domain‑specific expert (e.g., a security‑focused LLM) while keeping the lightweight protégé, enabling tailored assistance without re‑training the whole stack.
Scalable CI/CD Integration: Automated code‑review bots could run the protégé on every PR; only the few cases that stall would trigger an expensive expert call, dramatically reducing CI costs.

Limitations & Future Work

Dependency on a Strong Expert: The framework still requires access to a high‑quality, often proprietary, large model for the mentorship phase, which may limit fully open‑source deployments.
Sparse Expert Signals: While the system learns when to ask, the decision policy is heuristic‑driven and may miss subtle bugs that need expert insight early on.
Generalization Beyond Repair: The study focuses on bug‑fixing (SWE‑bench). Extending the mentor‑protégé paradigm to tasks like feature implementation, refactoring, or documentation generation remains an open question.
RL Stability: The reinforcement‑learning stage can be sensitive to reward shaping; future work could explore more robust, automated reward design or curriculum learning strategies.

Overall, SWE‑Protégé shows that small models don’t have to stay in the shadows of their giant counterparts—by learning to ask the right questions at the right time, they can become practical, affordable software‑engineering assistants.

Authors

Patrick Tser Jern Kon
Archana Pradeep
Ang Chen
Alexander P. Ellis
Warren Hunt
Zijian Wang
John Yang
Samuel Thompson

Paper Information

arXiv ID: 2602.22124v1
Categories: cs.SE, cs.AI, cs.CL, cs.LG
Published: February 25, 2026
PDF: Download PDF

[Paper] SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

[Paper] SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

[Paper] AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

[Paper] Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?