[Paper] SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents

Published: (January 29, 2026 at 01:50 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2601.22129v1

Overview

The paper SWE‑Replay tackles a pain point for anyone using large‑language‑model (LLM) agents to automate software‑engineering (SWE) tasks: test‑time scaling. While running an agent many times (or “sampling trajectories”) can boost success rates, it also blows up compute costs. SWE‑Replay proposes a clever way to reuse work from previous runs, cutting expense without sacrificing – and sometimes even improving – performance.

Key Contributions

  • Replay‑based scaling: Introduces the first test‑time scaling method that re‑uses whole or partial execution traces instead of starting from scratch each time.
  • Dynamic explore‑or‑exploit decision: A lightweight heuristic picks “branch points” where the agent can either continue a fresh search or fork from a previously successful intermediate state.
  • Tool‑agnostic design: Works with modern SWE agents that generate custom bash scripts or other external tools, avoiding reliance on noisy value‑function estimates.
  • Empirical gains: Demonstrates up to 17.4 % cost reduction and up to 3.8 % accuracy improvement on the SWE‑Bench Verified benchmark.
  • Broad validation: Shows consistent benefits on SWE‑Bench Pro and multilingual variants, proving the approach generalizes across datasets and languages.

Methodology

  1. Collect initial trajectories: Run the target SWE agent a few times on a given task, storing the full sequence of actions (e.g., repository queries, bash script generations, code edits).
  2. Identify “critical” steps: Instead of using a separate LLM to rank quality, SWE‑Replay evaluates each step’s potential (how much new repository information it unlocks) and reasoning significance (how central the step is to the overall solution).
  3. Branching logic:
    • Explore: For steps deemed low‑potential, the system discards the old trace and samples a fresh trajectory.
    • Exploit (Replay): For high‑potential steps, it forks a new run from the stored intermediate state, re‑using the already‑executed actions.
  4. Iterative scaling: The process repeats, gradually building a pool of reusable sub‑trajectories. The final answer is selected from the best‑scoring completed runs (e.g., using the standard SWE‑Bench verification metric).

The whole pipeline adds only a small overhead (metadata bookkeeping and heuristic scoring) compared to the heavy cost of re‑executing large language models and external tools.

Results & Findings

BenchmarkNaïve scaling (baseline)SWE‑ReplayCost reductionAccuracy change
SWE‑Bench Verified71.2 % pass@174.0 %‑17.4 %+3.8 %
SWE‑Bench Pro65.5 % → 66.9 %‑12.1 %
Multilingual (Java, Python…)58.3 % → 60.1 %‑15.8 %

Key Takeaways

  • Efficiency: By re‑using work, the average number of LLM calls per task drops noticeably, directly translating to lower GPU time and API spend.
  • Robustness: The heuristic for picking branch points works across languages and task complexities, indicating the method isn’t over‑fitted to a single dataset.
  • No quality loss: Even when the cost is cut, the success rate stays flat or improves, suggesting that many “fresh” runs were redundant in the first place.

Practical Implications

  • Cheaper CI/CD pipelines: Teams that embed LLM‑based code reviewers or automated bug‑fix generators can run more aggressive scaling (e.g., 10‑way sampling) without blowing up their cloud bill.
  • Faster prototyping: Developers experimenting with new prompts or tool‑integration strategies can get higher‑quality results in fewer iterations.
  • Tool‑chain compatibility: Since SWE‑Replay doesn’t depend on a separate value model, it can be dropped into existing agents that already call out to shells, Docker, or custom scripts.
  • Scalable SaaS offerings: Companies offering “AI‑assisted coding” as a service can improve SLA metrics (latency, success rate) while keeping operational costs predictable.

Limitations & Future Work

  • Heuristic sensitivity: The current potential/significance scoring is hand‑crafted; edge cases (e.g., highly nondeterministic tools) may misclassify a step, leading to sub‑optimal branching.
  • Memory overhead: Storing full trajectories, especially large bash scripts or container snapshots, can increase disk usage for massive workloads.
  • Generalization beyond SWE: The paper focuses on software‑engineering agents; applying the replay idea to other LLM‑driven domains (e.g., data‑analysis notebooks) remains an open question.
  • Future directions: The authors suggest learning the branch‑point policy from data, integrating lightweight value estimators to complement the heuristic, and exploring hierarchical replay (re‑using sub‑tasks across different problems).

Authors

  • Yifeng Ding
  • Lingming Zhang

Paper Information

  • arXiv ID: 2601.22129v1
  • Categories: cs.SE, cs.AI, cs.LG
  • Published: January 29, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »