Coding Agent Teams Outperform Solo Agents: 72.2% on SWE-bench Verified

Published: (March 2, 2026 at 07:10 AM EST)
5 min read
Source: Dev.to

Source: Dev.to

Most AI coding agents work alone. You give them an issue, they figure it out, and they hand you a fix. It’s the AI equivalent of a lone‑wolf developer—capable, but not how real software teams actually operate.

A team of researchers at Agyn asked a different question: what if, instead of a single agent, you used a coding‑agent team—with real roles, real review loops, and real coordination?

The results are hard to ignore.


The Idea: Stop Treating Issue Resolution as a Solo Task

Real software development involves coordination. A problem lands, someone researches it, someone else implements a fix, a reviewer pushes back, and things iterate. The system that emerges from that process is more robust than anything one person (or one agent) would ship alone.

The Agyn system—described in a paper published on arXiv—encodes this directly. Rather than routing a GitHub issue through a single agent with a huge context window, it spins up a team:

RoleResponsibility
ManagerCoordinates execution, communication, and knows when to stop
ResearcherExplores the repository, gathers context, writes the specification
EngineerImplements the fix, debugs failures
ReviewerEvaluates the PR and enforces acceptance criteria

Each agent has a clearly scoped role, runs in its own isolated sandbox, and communicates through standard GitHub artifacts—commits, PR descriptions, and review comments. Just like a real team would.


Why Coding‑Agent Teams Work Better Than Solo Agents

A few design decisions make this more than just “more agents”:

  1. Isolated execution environments – Every agent gets its own sandbox with shell access and no shared filesystem. Agents can install dependencies, run tests, and configure their environment without stepping on each other, making failures easy to attribute.

  2. Explicit role enforcement – Each role specifies which model to use, what reasoning level, what tools, and what responsibilities. This prevents the “do everything” trap where a single agent accumulates too much context and starts hallucinating. It also lets you allocate expensive, high‑reasoning models only where they’re needed.

  3. Structured communication, not a fixed pipeline – The Manager dynamically coordinates execution rather than following a static script. If the Reviewer rejects the PR, the Engineer iterates. The system adapts.

  4. Context management for long tasks – Large artifacts are persisted to the filesystem rather than stuffed into the model context. Accumulated context is summarized automatically, allowing the system to run end‑to‑end on complex issues without falling apart.


Benchmark Results

The team evaluated the system on SWE‑bench Verified, a widely used benchmark where models must resolve real GitHub issues by modifying codebases and producing PRs that pass the project’s test suite.

SystemModel(s)Resolved
agynGPT‑5 / GPT‑5‑Codex (medium reasoning)72.2 %
OpenHandsGPT‑5 (high reasoning)71.8 %
mini‑SWE‑agentGPT‑5.2 (high reasoning)71.8 %
mini‑SWE‑agentGPT‑5 (medium reasoning)65.0 %

Key detail: this system wasn’t tuned for the benchmark. The same prompts, role definitions, tools, and execution model used in production were applied directly. It outperformed competitors using higher‑reasoning model variants—without needing them.

The 7.2 % gain over the single‑agent baseline (using the same model class) comes purely from the team structure.


What This Means for Agent Design

The paper makes an argument that’s easy to overlook in the current race to improve models: organizational design matters as much as model quality.

  • We’ve spent a lot of energy making individual models smarter. But real‑world software development scaled because of how teams work—division of labor, code review, shared artifacts, iteration. Replicating that structure in an agent system produces measurable gains without touching the underlying model.

Take‑aways

  • Role separation reduces errors. When each agent has a narrow job, there’s less opportunity for confusion and accumulated mistakes.
  • Review loops improve output quality. A dedicated Reviewer can send work back to the Engineer, catching problems before they become permanent.
  • You don’t always need the biggest model. Allocating medium‑reasoning models across a well‑structured team can beat a single high‑reasoning agent doing everything.

What’s Next

The Agyn platform is open source on GitHub.

We believe the future is not a single general‑purpose “super agent,” but teams of specialized agents, organized the way real organizations operate: different roles, different responsibilities, clear coordination, explicit review, shared context. And we’re building toward that vision.

Coming Next

  1. Flexible, Modular Agent Organizations

    • Define custom roles
    • Assign different models per role
    • Configure tools and permissions
    • Isolate execution environments
    • Design explicit coordination flows

    Not a monolith—an organization.

  2. New Agent Communication Paradigms
    Real teams do not operate in a single synchronous loop. They:

    • Open threads
    • Leave structured comments
    • Request reviews
    • Resume work later
    • Escalate decisions

    We are introducing structured communication protocols between agents, including asynchronous collaboration, so coordination can happen across time, not just across steps.

The lone‑wolf agent had a good run. The team might take it from here.

Paper: Agyn: A Multi‑Agent System for Team‑Based Software Engineering (arXiv:2602.01465).

0 views
Back to Blog

Related posts

Read more »

Google Gemini Writing Challenge

What I Built - Where Gemini fit in - Used Gemini’s multimodal capabilities to let users upload screenshots of notes, diagrams, or code snippets. - Gemini gener...