AWS re:Invent 2025 - Code completion to agents: The evolution of development (DVT405)

Published: (December 6, 2025 at 07:49 AM EST)
3 min read
Source: Dev.to

Source: Dev.to

Introduction

Giovanni Zappella and Laurent Callot (Principal Scientists, Amazon Q Developer) give a brief overview of the evolution of coding agents—from simple code completion to autonomous systems such as the Taxy Code Agent (38 % on SWE‑bench verified) and the Logos Agent (51 % with sandbox execution). They also present Huang, a supervisor‑sub‑agent architecture for handling complex tasks.

Key take‑aways

  • Optimizing agents for specific use cases
  • Defining relevant metrics (e.g., the PolyBench benchmark)
  • Building reliable systems with the Stren Agent SDK and Amazon Bedrock Agent Core
  • Maintaining flexibility as models and customer needs evolve

From Code Completion to Autonomous Agents

Early IDEs

  • Basic syntax highlighting and simple autocompletion.
  • Developers still needed to manage details like semicolons and manual compilation.

Modern IDEs

  • Advanced suggestions for variable names, method signatures, and larger code snippets.
  • Still primarily a productivity aid—speeding up typing rather than writing code autonomously.

First Autonomous Agents

  • Example: Amazon Q Developer CLI.
  • Takes a natural‑language problem description, interacts with the file system, identifies files to modify, and applies changes to achieve a goal.

Two Families of Coding Agents

1. Synchronous (Companion) Agents

  • Operate interactively, acting as a developer’s assistant.
  • Help accelerate task completion while the developer remains in control.

2. Asynchronous (Autonomous) Agents

  • Developers delegate tasks; agents work independently, possibly in parallel.
  • Can handle long‑running or short tasks without continuous human supervision.

Typical touch points

  1. Task definition – the developer specifies the goal.
  2. Autonomous execution – the agent performs work (e.g., creates a pull request).
  3. Human review – the developer reviews and merges the changes.
  4. Iterative refinement – the agent may iterate until the code is ready for shipping.

Benchmarking with SWE‑bench

  • SWE‑bench is a benchmark derived from real GitHub issues.

Process

  1. Remove the human‑written solution from a GitHub issue.
  2. Let an agent generate the missing code.
  3. Run the original unit tests; passing tests indicate a correct solution.

Limitations

  • Unit tests may not cover every edge case.
  • Human reviewers may miss defects.

Despite these limits, the benchmark correlates well with code quality. The authors used SWE‑bench to track progress over time, retroactively computing results for earlier agent versions to illustrate evolution.

Architectural Evolution of Coding Agents

Early Attempts

  • RAG‑based (Retrieval‑Augmented Generation) solutions with modest performance.

Taxy Code Agent

  • Achieved 38 % verification on SWE‑bench.
  • Introduced more sophisticated retrieval and reasoning components.

Logos Agent

  • Reached 51 % verification with sandboxed execution.
  • Added safety checks and tighter integration with execution environments.

Huang (Supervisor‑Sub‑Agent Architecture)

  • Hierarchical design where a supervisor delegates subtasks to specialized sub‑agents.
  • Enables handling of complex, multi‑step development tasks.

Building Reliable Agents

  • Stren Agent SDK: Provides abstractions for building, testing, and deploying agents.
  • Amazon Bedrock Agent Core: Offers managed infrastructure, model access, and security controls.

Best practices

  1. Use case‑driven optimization – tailor the agent’s capabilities to the target workflow.
  2. Create custom metrics – such as PolyBench, to measure performance beyond generic benchmarks.
  3. Maintain modularity – allow swapping models or components as technology evolves.

Lessons Learned

  • Flexibility is essential: Models, APIs, and developer expectations change rapidly.
  • Metrics matter: Standard benchmarks help, but domain‑specific metrics give clearer insight.
  • Iterative development: Continuous evaluation against real‑world tasks drives meaningful improvements.

This article is auto‑generated from the original presentation content and may contain minor typographical errors.

Back to Blog

Related posts

Read more »