Coding Agent Teams Outperform Solo Agents: 72.2% on SWE-bench Verified

Published: 2 months ago (March 2, 2026 at 07:10 AM EST)

5 min read

Source: Dev.to

Source: Dev.to

Source: Dev.to

Most AI coding agents work alone. You give them an issue, they figure it out, and they hand you a fix. It’s the AI equivalent of a lone‑wolf developer—capable, but not how real software teams actually operate.

A team of researchers at Agyn asked a different question: what if, instead of a single agent, you used a coding‑agent team—with real roles, real review loops, and real coordination?

The results are hard to ignore.

The Idea: Stop Treating Issue Resolution as a Solo Task

Real software development involves coordination. A problem lands, someone researches it, another implements a fix, a reviewer pushes back, and the process iterates. The system that emerges from that collaboration is far more robust than anything a single person—or a single agent—could ship alone.

The Agyn system—described in a paper published on arXiv—encodes this workflow directly. Instead of routing a GitHub issue through one agent with an enormous context window, Agyn spins up a team of specialized agents:

Role	Responsibility
Manager	Coordinates execution, communication, and knows when to stop
Researcher	Explores the repository, gathers context, writes the specification
Engineer	Implements the fix and debugs failures
Reviewer	Evaluates the PR and enforces acceptance criteria

Each agent has a clearly scoped role, runs in its own isolated sandbox, and communicates through standard GitHub artifacts—commits, PR descriptions, and review comments—just like a real development team would.

Why Coding‑Agent Teams Work Better Than Solo Agents

A few design decisions make this more than just “more agents”:

Isolated execution environments – Every agent gets its own sandbox with shell access and no shared filesystem. Agents can install dependencies, run tests, and configure their environment without stepping on each other, making failures easy to attribute.
Explicit role enforcement – Each role specifies which model to use, what reasoning level, what tools, and what responsibilities. This prevents the “do‑everything” trap where a single agent accumulates too much context and starts hallucinating. It also lets you allocate expensive, high‑reasoning models only where they’re needed.
Structured communication, not a fixed pipeline – The Manager dynamically coordinates execution rather than following a static script. If the Reviewer rejects the PR, the Engineer iterates. The system adapts.
Context management for long tasks – Large artifacts are persisted to the filesystem rather than stuffed into the model context. Accumulated context is summarized automatically, allowing the system to run end‑to‑end on complex issues without falling apart.

Benchmark Results

The team evaluated the system on SWE‑bench Verified, a widely used benchmark where models must resolve real GitHub issues by modifying codebases and producing PRs that pass the project’s test suite.

System	Model(s)	Resolved
agyn	GPT‑5 / GPT‑5‑Codex (medium reasoning)	72.2 %
OpenHands	GPT‑5 (high reasoning)	71.8 %
mini‑SWE‑agent	GPT‑5.2 (high reasoning)	71.8 %
mini‑SWE‑agent	GPT‑5 (medium reasoning)	65.0 %

Key detail: This system wasn’t tuned for the benchmark. The same prompts, role definitions, tools, and execution model used in production were applied directly. It outperformed competitors using higher‑reasoning model variants—without needing them.

The 7.2 % gain over the single‑agent baseline (using the same model class) comes purely from the team structure.

What This Means for Agent Design

The paper makes an argument that’s easy to overlook in the current race to improve models: organizational design matters as much as model quality.

We’ve spent a lot of energy making individual models smarter.
Real‑world software development scaled because of how teams work—division of labor, code review, shared artifacts, and iteration.
Replicating that structure in an agent system produces measurable gains without touching the underlying model.

Take‑aways

Role separation reduces errors. When each agent has a narrow job, there’s less opportunity for confusion and accumulated mistakes.
Review loops improve output quality. A dedicated Reviewer can send work back to the Engineer, catching problems before they become permanent.
You don’t always need the biggest model. Allocating medium‑reasoning models across a well‑structured team can beat a single high‑reasoning agent doing everything.

What’s Next

The Agyn platform is open source on GitHub.

We believe the future is not a single general‑purpose “super agent,” but teams of specialized agents, organized the way real organizations operate: different roles, different responsibilities, clear coordination, explicit review, and shared context. We’re building toward that vision.

Coming Next

Flexible, Modular Agent Organizations
- Define custom roles
- Assign different models per role
- Configure tools and permissions
- Isolate execution environments
- Design explicit coordination flows
Not a monolith—an organization.
New Agent Communication Paradigms
Real teams do not operate in a single synchronous loop. They:
- Open threads
- Leave structured comments
- Request reviews
- Resume work later
- Escalate decisions
We are introducing structured communication protocols between agents, including asynchronous collaboration, so coordination can happen across time, not just across steps.

The lone‑wolf agent had a good run. The team might take it from here.

Paper: Agyn: A Multi‑Agent System for Team‑Based Software Engineering (arXiv:2602.01465).

Based Autonomous Software Engineering — Nikita Benkovich, Vitalii Valkov (2026)
Blog post: We tested how an AI team improves issue resolution on SWE‑bench Verified

Coding Agent Teams Outperform Solo Agents: 72.2% on SWE-bench Verified

The Idea: Stop Treating Issue Resolution as a Solo Task

Why Coding‑Agent Teams Work Better Than Solo Agents

Benchmark Results

What This Means for Agent Design

Take‑aways

What’s Next

Coming Next

Related posts

Perplexity Announces 'Computer,' an AI Agent That Assigns Work To Other AI Agent

AI-to-AI coordination protocol concept (Synthospeak) — useful abstraction or unnecessary layer?

[Paper] From Flat Logs to Causal Graphs: Hierarchical Failure Attribution for LLM-based Multi-Agent Systems

[Paper] AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning