[Paper] SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents
Source: arXiv - 2601.22129v1
Overview
The paper SWE‑Replay tackles a pain point for anyone using large‑language‑model (LLM) agents to automate software‑engineering (SWE) tasks: test‑time scaling. While running an agent many times (or “sampling trajectories”) can boost success rates, it also blows up compute costs. SWE‑Replay proposes a clever way to reuse work from previous runs, cutting expense without sacrificing – and sometimes even improving – performance.
Key Contributions
- Replay‑based scaling: Introduces the first test‑time scaling method that re‑uses whole or partial execution traces instead of starting from scratch each time.
- Dynamic explore‑or‑exploit decision: A lightweight heuristic picks “branch points” where the agent can either continue a fresh search or fork from a previously successful intermediate state.
- Tool‑agnostic design: Works with modern SWE agents that generate custom bash scripts or other external tools, avoiding reliance on noisy value‑function estimates.
- Empirical gains: Demonstrates up to 17.4 % cost reduction and up to 3.8 % accuracy improvement on the SWE‑Bench Verified benchmark.
- Broad validation: Shows consistent benefits on SWE‑Bench Pro and multilingual variants, proving the approach generalizes across datasets and languages.
Methodology
- Collect initial trajectories: Run the target SWE agent a few times on a given task, storing the full sequence of actions (e.g., repository queries, bash script generations, code edits).
- Identify “critical” steps: Instead of using a separate LLM to rank quality, SWE‑Replay evaluates each step’s potential (how much new repository information it unlocks) and reasoning significance (how central the step is to the overall solution).
- Branching logic:
- Explore: For steps deemed low‑potential, the system discards the old trace and samples a fresh trajectory.
- Exploit (Replay): For high‑potential steps, it forks a new run from the stored intermediate state, re‑using the already‑executed actions.
- Iterative scaling: The process repeats, gradually building a pool of reusable sub‑trajectories. The final answer is selected from the best‑scoring completed runs (e.g., using the standard SWE‑Bench verification metric).
The whole pipeline adds only a small overhead (metadata bookkeeping and heuristic scoring) compared to the heavy cost of re‑executing large language models and external tools.
Results & Findings
| Benchmark | Naïve scaling (baseline) | SWE‑Replay | Cost reduction | Accuracy change |
|---|---|---|---|---|
| SWE‑Bench Verified | 71.2 % pass@1 | 74.0 % | ‑17.4 % | +3.8 % |
| SWE‑Bench Pro | 65.5 % → 66.9 % | — | ‑12.1 % | — |
| Multilingual (Java, Python…) | 58.3 % → 60.1 % | — | ‑15.8 % | — |
Key Takeaways
- Efficiency: By re‑using work, the average number of LLM calls per task drops noticeably, directly translating to lower GPU time and API spend.
- Robustness: The heuristic for picking branch points works across languages and task complexities, indicating the method isn’t over‑fitted to a single dataset.
- No quality loss: Even when the cost is cut, the success rate stays flat or improves, suggesting that many “fresh” runs were redundant in the first place.
Practical Implications
- Cheaper CI/CD pipelines: Teams that embed LLM‑based code reviewers or automated bug‑fix generators can run more aggressive scaling (e.g., 10‑way sampling) without blowing up their cloud bill.
- Faster prototyping: Developers experimenting with new prompts or tool‑integration strategies can get higher‑quality results in fewer iterations.
- Tool‑chain compatibility: Since SWE‑Replay doesn’t depend on a separate value model, it can be dropped into existing agents that already call out to shells, Docker, or custom scripts.
- Scalable SaaS offerings: Companies offering “AI‑assisted coding” as a service can improve SLA metrics (latency, success rate) while keeping operational costs predictable.
Limitations & Future Work
- Heuristic sensitivity: The current potential/significance scoring is hand‑crafted; edge cases (e.g., highly nondeterministic tools) may misclassify a step, leading to sub‑optimal branching.
- Memory overhead: Storing full trajectories, especially large bash scripts or container snapshots, can increase disk usage for massive workloads.
- Generalization beyond SWE: The paper focuses on software‑engineering agents; applying the replay idea to other LLM‑driven domains (e.g., data‑analysis notebooks) remains an open question.
- Future directions: The authors suggest learning the branch‑point policy from data, integrating lightweight value estimators to complement the heuristic, and exploring hierarchical replay (re‑using sub‑tasks across different problems).
Authors
- Yifeng Ding
- Lingming Zhang
Paper Information
- arXiv ID: 2601.22129v1
- Categories: cs.SE, cs.AI, cs.LG
- Published: January 29, 2026
- PDF: Download PDF