How we built AEO tracking for coding agents

Published: 3 days ago (February 9, 2026 at 08:00 AM EST)

9 min read

Source: Vercel Blog

AI‑Driven Search & Summarization – AEO Overview

AI has changed the way people find information. For businesses, this means it’s critical to understand how large‑language models (LLMs) search for and summarize their web content.

We’re building an AI Engine Optimization (AEO) system to track how models discover, interpret, and reference Vercel and our sites.

From Prototype to Full‑Stack Visibility

Initial focus: Standard chat models (e.g., GPT, Gemini, Claude).
Realisation: To obtain a complete picture of visibility, we also need to track coding agents that developers use directly from their terminals or IDEs.

Tracking Standard Models

For standard models, tracking is relatively straightforward. We use AI Gateway to send prompts to dozens of popular models and analyze:

Responses
Search behaviour
Cited sources

# Example (pseudo‑code)
gateway.sendPrompt(model="gpt-4", prompt="...")

Challenges with Coding Agents

Coding agents behave very differently:

Invocation method – they are invoked via CLIs rather than pure API calls.
Environment requirements – they need a full development environment (filesystem, shell, package managers).
Prompt characteristics – in early sampling, ~20 % of prompts triggered a web search, which mirrors real development workflows and makes source‑accuracy evaluation essential.

New Requirements

Ephemeral execution environments – each run must be isolated.
Uniform lifecycle – the process should be consistent regardless of the CLI used.

Solution: Vercel Sandbox provides Linux MicroVMs that spin up in seconds. Each agent run gets its own sandbox and follows a six‑step lifecycle.

Agent Lifecycle (Code View)

// Pseudo‑type definition for an agent config
interface AgentConfig {
  name: string;                     // Human‑readable name
  baseImage: string;                // Docker image / runtime (Node, Python, …)
  setupCommands?: string[];        // Extra install steps (e.g., TOML config)
  buildCommand: (prompt: string) => string; // Returns the CLI command to run
}

baseImage – Determines the MicroVM image. Most agents run on Node, but Python runtimes are also supported.
setupCommands – An array because some agents need more than a global install (e.g., Codex also needs a TOML file written to ~/.codex/config.toml).
buildCommand – A function that takes the user prompt and returns the exact shell command to execute. Each agent’s CLI has its own flags and invocation style.

Centralising Cost & Logging with AI Gateway

We override each provider’s base URL via environment variables inside the sandbox. This makes the agents think they are talking directly to their native endpoints, while all traffic is actually proxied through AI Gateway.

Example: Claude Code

Variable	Value (inside sandbox)	Purpose
`ANTHROPIC_BASE_URL`	`http://gateway.internal/anthropic`	Points to AI Gateway instead of `api.anthropic.com`.
`ANTHROPIC_API_KEY`	`""` (empty)	The gateway authenticates with its own token; the agent needs no direct provider key.

The same pattern works for other agents (e.g., override OPENAI_BASE_URL for Codex). Any provider that respects a base‑URL environment variable can be routed in this way.

Normalising Heterogeneous Transcripts

When an agent finishes, we have a raw transcript—a record of everything it did. Unfortunately each agent emits this data in a different format:

Agent	Output location	Format
Claude Code	JSONL file on disk	`*.jsonl`
Codex	Streamed to `stdout`	JSON lines
OpenCode	Streamed to `stdout`	Different JSON schema

Four‑Stage Normalisation Pipeline

Capture – while the sandbox is still running (step 5 of the lifecycle).
- Claude Code writes a JSONL file → we read it after the run.
- Codex & OpenCode stream JSON lines to stdout → we capture and filter those lines.
Raw JSONL Consolidation – all agents now produce a single string of raw JSONL lines.
Agent‑Specific Parsing – each parser does two things:
- Tool‑name normalisation – map agent‑specific names to a set of ~10 canonical names.
- Message‑shape flattening – collapse agent‑specific nesting into a unified TranscriptEvent type.
```
// Example lookup table (partial)
const TOOL_MAP = {
  search:   "web_fetch",
  http_get: "web_fetch",
  fs_write: "file_write",
  // …
};
```
Post‑Processing – enrich the TranscriptEvent[] with structured metadata (e.g., extract file paths from args.path vs. args.file).

The resulting array (TranscriptEvent[]) is fed into the same brand‑extraction pipeline used for standard model responses, making the downstream system agnostic to the source (model API vs. coding agent).

End‑to‑End Workflow

When a prompt is tagged as type: "agents", the Vercel workflow fans out across all configured agents in parallel, each running in its own sandbox.

flowchart TD
    Prompt["prompt"] --> Vercel["Vercel Workflow"]
    Vercel --> A["Agent A (sandbox)"]
    Vercel --> B["Agent B (sandbox)"]
    Vercel --> C["…"]
    
    A --> A1["transcript"]
    A1 --> A2["normalised events"]
    
    B --> B1["transcript"]
    B1 --> B2["normalised events"]
    
    C --> C1["transcript"]
    C1 --> C2["normalised events"]
    
    Vercel --> Stats["Aggregate stats (tool calls, web fetches, errors)"]
    Stats --> Brand["Brand extraction pipeline"]

Prompt → Vercel Workflow
Each Agent runs in its own sandbox, producing a transcript that is turned into normalized events.
The workflow then aggregates statistics (tool calls, web fetches, errors) and passes the results to the Brand extraction pipeline.

Agent‑as‑Code (AEO) Lifecycle

How we run, observe, and extract insights from coding agents.

1. Observability

Spin up a fresh MicroVM

Choose the right runtime (Node 24, Python 3.13, etc.).
Set a hard timeout – the sandbox will kill the agent if it hangs or loops.

Create the sandbox

Each agent ships as an npm package (e.g., @anthropic-ai/claude-code, @openai/codex, @vercel/open‑code).
The sandbox installs the package globally so the CLI is available as a shell command.

Install the agent CLI

npm install -g @anthropic-ai/claude-code   # Claude Code
npm install -g @openai/codex               # Codex
npm install -g @vercel/open-code           # OpenCode

Inject credentials

Instead of giving each agent a direct provider API key, set environment variables that route all LLM calls through Vercel AI Gateway.
Benefits: unified logging, rate‑limiting, and cost tracking across every agent (even when they use different underlying providers).
Direct provider keys are still supported if needed.

Run the agent

The only step that differs per agent is the CLI invocation pattern, flags, and config format. From the sandbox’s perspective it is just a shell command, e.g.:

# Claude Code
claude-code --prompt "Write a function that parses CSV"

# Codex
codex run --task "Generate a React component"

Capture the transcript

After the agent finishes, extract a record of what it did:

Which tools it called.
Whether it performed a web search.
What it recommended in the response.

Note: This step is agent‑specific (see “Transcript capture” below).

Tear down

The sandbox is always stopped (even on error) so resources are never leaked.

2. Transcript Capture

Each agent stores its transcript differently, so we provide a per‑agent parser that normalises:

Tool names – map the agent‑specific identifiers to a unified set.
Message shapes – flatten agent‑specific structures into a single unified event type.

Parsing

Shared post‑processing extracts structured metadata (e.g., URLs, commands) from tool arguments and normalises naming differences.

Enrichment

Aggregate the unified events into statistics.
Feed the enriched data into the same brand‑extraction pipeline used for standard model responses.

Summary & Brand Extraction

The final stage produces a concise summary and extracts brand mentions, enabling direct comparison between agents and vanilla LLMs.

3. Lifecycle Stages (Unified View)

Stage	Description
Stage 1 – Transcript capture	Pull the raw agent output (tool calls, web searches, recommendations).
Stage 2 – Parsing tool names & message shapes	Normalise tool identifiers and flatten message structures.
Stage 3 – Enrichment	Add structured metadata (URLs, commands) and compute stats.
Stage 4 – Summary & brand extraction	Produce a human‑readable summary and run brand extraction.

4. Tool Mapping (Agent‑specific → Unified)

Unified Action	Claude Code	Codex	OpenCode
Read a file	`Read`	`read_file`	`read`
Write a file	`Write`	`write_file`	`write`
Edit a file	`StrReplace`	`patch_file`	`patch`
Run a command	`Bash`	`shell`	`bash`
Search the web	`WebFetch`	(varies)	(varies)

5. Agent‑Specific Transcript Details

Agent	Transcript quirks
Claude Code	Nests messages inside a `content` property and mixes `tool_use` blocks into content arrays.
Codex	Emits Responses API lifecycle events (`thread.started`, `turn.completed`, `output_text.delta`) alongside tool events.
OpenCode	Bundles tool call and result in the same event via `part.tool` and `part.state`.

6. Observations & Findings

Search frequency – Early tests on a random sample of prompts showed coding agents execute a web search ≈ 20 % of the time. As we collect more data we’ll build a comprehensive view of agent search behaviour.
Tool recommendations – When an agent suggests a tool, it usually emits working code (e.g., an import statement, a config file, or a deployment script). The recommendation is embedded in the output, not merely mentioned in prose.
Normalization importance – Agent CLI tools ship rapid updates, causing transcript formats to diverge quickly. Building a normalization layer early saved us from constant breakage.

“Transcript formats are a mess. The hard part is everything upstream: getting the agent to run, capturing what it did, and normalising it into a structure you can grade.” – Team note

7. Future Work

Open‑source the system – Release an OSS version so other teams can run their own AEO evaluations for both standard models and coding agents.
Full AEO eval methodology – Publish a follow‑up post covering:
- Prompt design.
- Dual‑mode testing (web search vs. training‑data recall).
- Query‑as‑first‑class‑entity architecture.
- Share‑of‑Voice metrics.
Scale agent coverage – Add more agents as the ecosystem grows and expand prompt types (e.g., full project scaffolding, debugging, performance tuning).

8. Reference List

Claude Code – @anthropic-ai/claude-code
Codex – @openai/codex
OpenCode – @vercel/open-code

Prepared for internal documentation and future open‑source release.

How we built AEO tracking for coding agents

AI‑Driven Search & Summarization – AEO Overview

From Prototype to Full‑Stack Visibility

Tracking Standard Models

Challenges with Coding Agents

New Requirements

Agent Lifecycle (Code View)

Centralising Cost & Logging with AI Gateway

Example: Claude Code

Normalising Heterogeneous Transcripts

Four‑Stage Normalisation Pipeline

End‑to‑End Workflow

Further Reading

Agent‑as‑Code (AEO) Lifecycle

1. Observability

Spin up a fresh MicroVM

Create the sandbox

Install the agent CLI

Inject credentials

Run the agent

Capture the transcript

Tear down

2. Transcript Capture

Parsing

Enrichment

Summary & Brand Extraction

3. Lifecycle Stages (Unified View)

4. Tool Mapping (Agent‑specific → Unified)

5. Agent‑Specific Transcript Details

6. Observations & Findings

7. Future Work

8. Reference List

Related posts

Image Classification with CNNs – Part 3: Understanding Max Pooling and Results

MiniMax's new open M2.5 and M2.5 Lightning near state-of-the-art while costing 1/20th of Claude Opus 4.6

I Built a Feedback Loop That Coaches LLMs at Runtime Using NumPy

Attackers prompted Gemini over 100,000 times while trying to clone it, Google says

AI‑Driven Search & Summarization – AEO Overview

From Prototype to Full‑Stack Visibility

Tracking Standard Models

Challenges with Coding Agents

New Requirements

Agent Lifecycle (Code View)

Centralising Cost & Logging with AI Gateway

Example: Claude Code

Normalising Heterogeneous Transcripts

Four‑Stage Normalisation Pipeline

End‑to‑End Workflow

Further Reading

Agent‑as‑Code (AEO) Lifecycle

1. Observability

Spin up a fresh MicroVM

Create the sandbox

Install the agent CLI

Inject credentials

Run the agent

Capture the transcript

Tear down

2. Transcript Capture

Parsing

Enrichment

Summary & Brand Extraction

3. Lifecycle Stages (Unified View)

4. Tool Mapping (Agent‑specific → Unified)

5. Agent‑Specific Transcript Details

6. Observations & Findings

7. Future Work

8. Reference List

Related posts

Image Classification with CNNs – Part 3: Understanding Max Pooling and Results

MiniMax's new open M2.5 and M2.5 Lightning near state-of-the-art while costing 1/20th of Claude Opus 4.6

I Built a Feedback Loop That Coaches LLMs at Runtime Using NumPy

Attackers prompted Gemini over 100,000 times while trying to clone it, Google says

Centralising Cost & Logging with AI Gateway

Example: Claude Code