How we built AEO tracking for coding agents
Source: Vercel Blog
AI‑Driven Search & Summarization – AEO Overview
AI has changed the way people find information. For businesses, this means it’s critical to understand how large‑language models (LLMs) search for and summarize their web content.
We’re building an AI Engine Optimization (AEO) system to track how models discover, interpret, and reference Vercel and our sites.
From Prototype to Full‑Stack Visibility
- Initial focus: Standard chat models (e.g., GPT, Gemini, Claude).
- Realisation: To obtain a complete picture of visibility, we also need to track coding agents that developers use directly from their terminals or IDEs.
Tracking Standard Models
For standard models, tracking is relatively straightforward. We use AI Gateway to send prompts to dozens of popular models and analyze:
- Responses
- Search behaviour
- Cited sources
# Example (pseudo‑code)
gateway.sendPrompt(model="gpt-4", prompt="...")
Challenges with Coding Agents
Coding agents behave very differently:
- Invocation method – they are invoked via CLIs rather than pure API calls.
- Environment requirements – they need a full development environment (filesystem, shell, package managers).
- Prompt characteristics – in early sampling, ~20 % of prompts triggered a web search, which mirrors real development workflows and makes source‑accuracy evaluation essential.
New Requirements
- Ephemeral execution environments – each run must be isolated.
- Uniform lifecycle – the process should be consistent regardless of the CLI used.
Solution: Vercel Sandbox provides Linux MicroVMs that spin up in seconds. Each agent run gets its own sandbox and follows a six‑step lifecycle.
Agent Lifecycle (Code View)
// Pseudo‑type definition for an agent config
interface AgentConfig {
name: string; // Human‑readable name
baseImage: string; // Docker image / runtime (Node, Python, …)
setupCommands?: string[]; // Extra install steps (e.g., TOML config)
buildCommand: (prompt: string) => string; // Returns the CLI command to run
}
baseImage– Determines the MicroVM image. Most agents run on Node, but Python runtimes are also supported.setupCommands– An array because some agents need more than a global install (e.g., Codex also needs a TOML file written to~/.codex/config.toml).buildCommand– A function that takes the user prompt and returns the exact shell command to execute. Each agent’s CLI has its own flags and invocation style.
Centralising Cost & Logging with AI Gateway
We override each provider’s base URL via environment variables inside the sandbox. This makes the agents think they are talking directly to their native endpoints, while all traffic is actually proxied through AI Gateway.
Example: Claude Code
| Variable | Value (inside sandbox) | Purpose |
|---|---|---|
ANTHROPIC_BASE_URL | http://gateway.internal/anthropic | Points to AI Gateway instead of api.anthropic.com. |
ANTHROPIC_API_KEY | "" (empty) | The gateway authenticates with its own token; the agent needs no direct provider key. |
The same pattern works for other agents (e.g., override OPENAI_BASE_URL for Codex). Any provider that respects a base‑URL environment variable can be routed in this way.
Normalising Heterogeneous Transcripts
When an agent finishes, we have a raw transcript—a record of everything it did. Unfortunately each agent emits this data in a different format:
| Agent | Output location | Format |
|---|---|---|
| Claude Code | JSONL file on disk | *.jsonl |
| Codex | Streamed to stdout | JSON lines |
| OpenCode | Streamed to stdout | Different JSON schema |
Four‑Stage Normalisation Pipeline
-
Capture – while the sandbox is still running (step 5 of the lifecycle).
- Claude Code writes a JSONL file → we read it after the run.
- Codex & OpenCode stream JSON lines to
stdout→ we capture and filter those lines.
-
Raw JSONL Consolidation – all agents now produce a single string of raw JSONL lines.
-
Agent‑Specific Parsing – each parser does two things:
- Tool‑name normalisation – map agent‑specific names to a set of ~10 canonical names.
- Message‑shape flattening – collapse agent‑specific nesting into a unified
TranscriptEventtype.
// Example lookup table (partial) const TOOL_MAP = { search: "web_fetch", http_get: "web_fetch", fs_write: "file_write", // … }; -
Post‑Processing – enrich the
TranscriptEvent[]with structured metadata (e.g., extract file paths fromargs.pathvs.args.file).
The resulting array (TranscriptEvent[]) is fed into the same brand‑extraction pipeline used for standard model responses, making the downstream system agnostic to the source (model API vs. coding agent).
End‑to‑End Workflow
When a prompt is tagged as type: "agents", the Vercel workflow fans out across all configured agents in parallel, each running in its own sandbox.
flowchart TD
Prompt["prompt"] --> Vercel["Vercel Workflow"]
Vercel --> A["Agent A (sandbox)"]
Vercel --> B["Agent B (sandbox)"]
Vercel --> C["…"]
A --> A1["transcript"]
A1 --> A2["normalised events"]
B --> B1["transcript"]
B1 --> B2["normalised events"]
C --> C1["transcript"]
C1 --> C2["normalised events"]
Vercel --> Stats["Aggregate stats (tool calls, web fetches, errors)"]
Stats --> Brand["Brand extraction pipeline"]
- Prompt → Vercel Workflow
- Each Agent runs in its own sandbox, producing a transcript that is turned into normalized events.
- The workflow then aggregates statistics (tool calls, web fetches, errors) and passes the results to the Brand extraction pipeline.
Further Reading
- Execution Isolation – How to safely run an autonomous agent that can execute arbitrary code.
- Capturing Agent Activity – Techniques for reliably recording what the agent did when each step completes.
Agent‑as‑Code (AEO) Lifecycle
How we run, observe, and extract insights from coding agents.
1. Observability
Spin up a fresh MicroVM
- Choose the right runtime (Node 24, Python 3.13, etc.).
- Set a hard timeout – the sandbox will kill the agent if it hangs or loops.
Create the sandbox
- Each agent ships as an npm package (e.g.,
@anthropic-ai/claude-code,@openai/codex,@vercel/open‑code). - The sandbox installs the package globally so the CLI is available as a shell command.
Install the agent CLI
npm install -g @anthropic-ai/claude-code # Claude Code
npm install -g @openai/codex # Codex
npm install -g @vercel/open-code # OpenCode
Inject credentials
- Instead of giving each agent a direct provider API key, set environment variables that route all LLM calls through Vercel AI Gateway.
- Benefits: unified logging, rate‑limiting, and cost tracking across every agent (even when they use different underlying providers).
- Direct provider keys are still supported if needed.
Run the agent
The only step that differs per agent is the CLI invocation pattern, flags, and config format. From the sandbox’s perspective it is just a shell command, e.g.:
# Claude Code
claude-code --prompt "Write a function that parses CSV"
# Codex
codex run --task "Generate a React component"
Capture the transcript
After the agent finishes, extract a record of what it did:
- Which tools it called.
- Whether it performed a web search.
- What it recommended in the response.
Note: This step is agent‑specific (see “Transcript capture” below).
Tear down
- The sandbox is always stopped (even on error) so resources are never leaked.
2. Transcript Capture
Each agent stores its transcript differently, so we provide a per‑agent parser that normalises:
- Tool names – map the agent‑specific identifiers to a unified set.
- Message shapes – flatten agent‑specific structures into a single unified event type.
Parsing
- Shared post‑processing extracts structured metadata (e.g., URLs, commands) from tool arguments and normalises naming differences.
Enrichment
- Aggregate the unified events into statistics.
- Feed the enriched data into the same brand‑extraction pipeline used for standard model responses.
Summary & Brand Extraction
- The final stage produces a concise summary and extracts brand mentions, enabling direct comparison between agents and vanilla LLMs.
3. Lifecycle Stages (Unified View)
| Stage | Description |
|---|---|
| Stage 1 – Transcript capture | Pull the raw agent output (tool calls, web searches, recommendations). |
| Stage 2 – Parsing tool names & message shapes | Normalise tool identifiers and flatten message structures. |
| Stage 3 – Enrichment | Add structured metadata (URLs, commands) and compute stats. |
| Stage 4 – Summary & brand extraction | Produce a human‑readable summary and run brand extraction. |
4. Tool Mapping (Agent‑specific → Unified)
| Unified Action | Claude Code | Codex | OpenCode |
|---|---|---|---|
| Read a file | Read | read_file | read |
| Write a file | Write | write_file | write |
| Edit a file | StrReplace | patch_file | patch |
| Run a command | Bash | shell | bash |
| Search the web | WebFetch | (varies) | (varies) |
5. Agent‑Specific Transcript Details
| Agent | Transcript quirks |
|---|---|
| Claude Code | Nests messages inside a content property and mixes tool_use blocks into content arrays. |
| Codex | Emits Responses API lifecycle events (thread.started, turn.completed, output_text.delta) alongside tool events. |
| OpenCode | Bundles tool call and result in the same event via part.tool and part.state. |
6. Observations & Findings
- Search frequency – Early tests on a random sample of prompts showed coding agents execute a web search ≈ 20 % of the time. As we collect more data we’ll build a comprehensive view of agent search behaviour.
- Tool recommendations – When an agent suggests a tool, it usually emits working code (e.g., an
importstatement, a config file, or a deployment script). The recommendation is embedded in the output, not merely mentioned in prose. - Normalization importance – Agent CLI tools ship rapid updates, causing transcript formats to diverge quickly. Building a normalization layer early saved us from constant breakage.
“Transcript formats are a mess. The hard part is everything upstream: getting the agent to run, capturing what it did, and normalising it into a structure you can grade.” – Team note
7. Future Work
- Open‑source the system – Release an OSS version so other teams can run their own AEO evaluations for both standard models and coding agents.
- Full AEO eval methodology – Publish a follow‑up post covering:
- Prompt design.
- Dual‑mode testing (web search vs. training‑data recall).
- Query‑as‑first‑class‑entity architecture.
- Share‑of‑Voice metrics.
- Scale agent coverage – Add more agents as the ecosystem grows and expand prompt types (e.g., full project scaffolding, debugging, performance tuning).
8. Reference List
- Claude Code –
@anthropic-ai/claude-code - Codex –
@openai/codex - OpenCode –
@vercel/open-code
Prepared for internal documentation and future open‑source release.