[Paper] SAFEdit: Does Multi-Agent Decomposition Resolve the Reliability Challenges of Instructed Code Editing?

Published: 22 hours ago (April 28, 2026 at 11:04 AM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.25737v1

Overview

The paper introduces SAFEdit, a multi‑agent system designed to make large language models (LLMs) more reliable at instructed code editing—the task of taking existing source code and a natural‑language edit request, then producing a corrected version that passes the original test suite. By breaking the problem into specialized agents (Planner, Editor, Verifier) and adding a structured feedback loop, SAFEdit pushes task success rates on the EditBench benchmark past the 60 % barrier that most single‑model approaches cannot cross.

Key Contributions

Multi‑agent decomposition: Introduces three cooperating agents (Planner, Editor, Verifier) that each focus on a narrow sub‑task, reducing the chance of “hallucinated” edits.
Visibility‑aware edit plans: The Planner generates explicit, step‑by‑step edit instructions that respect the code’s existing scope and dependencies.
Failure Abstraction Layer (FAL): Transforms raw test‑run logs into structured diagnostics, feeding concrete error signals back to the Editor for iterative refinement.
Empirical validation: Evaluates SAFEdit on 445 editing instances across five languages (English, Polish, Spanish, Chinese, Russian) and demonstrates a 68.6 % task success rate—3.8 % higher than the best prior single‑model result and 8.6 % higher than a ReAct baseline.
Iterative refinement impact analysis: Shows that the loop of verification → FAL → re‑editing contributes a 17.4 % lift in overall success.
Reduced instruction‑level hallucinations: Automated error analysis indicates fewer spurious code changes compared with single‑agent pipelines.

Methodology

Planner Agent reads the original code, the edit instruction, and the test suite, then produces a visibility‑aware edit plan—a concise list of concrete modifications (e.g., “replace line 12 with if (x > 0)”).
Editor Agent receives the plan and the code, applying only the literal changes specified, avoiding broader rewrites that could introduce bugs.
Verifier Agent runs the edited code against the provided tests. If all tests pass, the process ends.
Failure Abstraction Layer (FAL) kicks in when tests fail: it parses stack traces, error messages, and assertion failures into a structured diagnostic (e.g., “NullPointerException at line 23 in processData”).
The diagnostic is fed back to the Editor, which attempts a targeted correction. Steps 3‑5 repeat until the test suite passes or a preset iteration limit is reached.

All agents are implemented as separate LLM calls (e.g., GPT‑4 or comparable models) with prompts tailored to their role, allowing the system to reuse the same underlying model while enforcing role‑specific behavior.

Results & Findings

System	Task Success Rate (TSR)
Best prior single‑model (EditBench)	~64.8 %
ReAct single‑agent baseline (same evaluation)	~60.0 %
SAFEdit (full multi‑agent pipeline)	68.6 %

The iterative refinement loop (Verifier → FAL → Editor) alone accounts for a +17.4 % boost, confirming that concrete failure feedback is far more useful than a single “pass/fail” signal.
Across the five natural‑language locales, performance differences were modest, indicating the approach generalizes well to multilingual instructions.
Error‑analysis logs reveal a significant drop in instruction‑level hallucinations (e.g., adding unrelated functions) compared with single‑agent runs, suggesting the planner’s explicit plan curtails over‑generation.

Practical Implications

Developer tooling: SAFEdit’s architecture can be embedded in IDE extensions (VS Code, JetBrains) to provide reliable “fix‑my‑code” suggestions that respect existing test suites, reducing the need for manual debugging.
Continuous integration (CI): Automated code‑review bots could use the Planner + Editor + Verifier loop to propose patches that are guaranteed to pass the project’s unit tests before they are merged.
Multilingual support: Because the system works with edit instructions in several languages, it can serve globally distributed teams without requiring English‑only prompts.
Safety and compliance: The explicit plan and verification steps make the editing process auditable, helping organizations meet regulatory requirements for code provenance and change tracking.
Cost efficiency: By reusing a single LLM across roles rather than training multiple specialized models, SAFEdit keeps inference costs low while delivering higher reliability.

Limitations & Future Work

Model dependence: SAFEdit’s performance hinges on the underlying LLM’s ability to follow role‑specific prompts; weaker models may degrade the overall TSR.
Iteration budget: The current system caps the number of refinement cycles, which can limit success on particularly stubborn bugs. Adaptive stopping criteria could improve efficiency.
Scalability to large codebases: The planner currently operates on a bounded “spatial context” window; handling whole‑project edits may require hierarchical planning or chunking strategies.
Broader test semantics: EditBench uses unit tests; extending SAFEdit to integration or property‑based tests (e.g., fuzzing) remains an open challenge.
Human‑in‑the‑loop evaluation: While automated metrics show gains, user studies are needed to assess developer trust, usability, and the impact on real‑world development cycles.

By addressing these points, future versions of SAFEdit could become a cornerstone of trustworthy, LLM‑assisted software maintenance.

Authors

Noam Tarshish
Nofar Selouk
Daniel Hodisan
Bar Ezra Gafniel
Yuval Elovici
Asaf Shabtai
Eliya Nachmani

Paper Information

arXiv ID: 2604.25737v1
Categories: cs.SE, cs.AI
Published: April 28, 2026
PDF: Download PDF

[Paper] SAFEdit: Does Multi-Agent Decomposition Resolve the Reliability Challenges of Instructed Code Editing?

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Recursive Multi-Agent Systems

[Paper] How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

[Paper] Teacher Forcing as Generalized Bayes: Optimization Geometry Mismatch in Switching Surrogates for Chaotic Dynamics

[Paper] Carbon-Taxed Transformers: A Green Compression Pipeline for Overgrown Language Models