How we built AEO tracking for coding agents

Published: (February 9, 2026 at 08:00 AM EST)
9 min read

Source: Vercel Blog

AI‑Driven Search & Summarization – AEO Overview

AI has changed the way people find information. For businesses, this means it’s critical to understand how large‑language models (LLMs) search for and summarize their web content.

We’re building an AI Engine Optimization (AEO) system to track how models discover, interpret, and reference Vercel and our sites.


From Prototype to Full‑Stack Visibility

  • Initial focus: Standard chat models (e.g., GPT, Gemini, Claude).
  • Realisation: To obtain a complete picture of visibility, we also need to track coding agents that developers use directly from their terminals or IDEs.

Tracking Standard Models

For standard models, tracking is relatively straightforward. We use AI Gateway to send prompts to dozens of popular models and analyze:

  • Responses
  • Search behaviour
  • Cited sources
# Example (pseudo‑code)
gateway.sendPrompt(model="gpt-4", prompt="...")

Challenges with Coding Agents

Coding agents behave very differently:

  • Invocation method – they are invoked via CLIs rather than pure API calls.
  • Environment requirements – they need a full development environment (filesystem, shell, package managers).
  • Prompt characteristics – in early sampling, ~20 % of prompts triggered a web search, which mirrors real development workflows and makes source‑accuracy evaluation essential.

New Requirements

  1. Ephemeral execution environments – each run must be isolated.
  2. Uniform lifecycle – the process should be consistent regardless of the CLI used.

Solution: Vercel Sandbox provides Linux MicroVMs that spin up in seconds. Each agent run gets its own sandbox and follows a six‑step lifecycle.

Agent Lifecycle (Code View)

// Pseudo‑type definition for an agent config
interface AgentConfig {
  name: string;                     // Human‑readable name
  baseImage: string;                // Docker image / runtime (Node, Python, …)
  setupCommands?: string[];        // Extra install steps (e.g., TOML config)
  buildCommand: (prompt: string) => string; // Returns the CLI command to run
}
  • baseImage – Determines the MicroVM image. Most agents run on Node, but Python runtimes are also supported.
  • setupCommands – An array because some agents need more than a global install (e.g., Codex also needs a TOML file written to ~/.codex/config.toml).
  • buildCommand – A function that takes the user prompt and returns the exact shell command to execute. Each agent’s CLI has its own flags and invocation style.

Centralising Cost & Logging with AI Gateway

We override each provider’s base URL via environment variables inside the sandbox. This makes the agents think they are talking directly to their native endpoints, while all traffic is actually proxied through AI Gateway.

Example: Claude Code

VariableValue (inside sandbox)Purpose
ANTHROPIC_BASE_URLhttp://gateway.internal/anthropicPoints to AI Gateway instead of api.anthropic.com.
ANTHROPIC_API_KEY"" (empty)The gateway authenticates with its own token; the agent needs no direct provider key.

The same pattern works for other agents (e.g., override OPENAI_BASE_URL for Codex). Any provider that respects a base‑URL environment variable can be routed in this way.

Normalising Heterogeneous Transcripts

When an agent finishes, we have a raw transcript—a record of everything it did. Unfortunately each agent emits this data in a different format:

AgentOutput locationFormat
Claude CodeJSONL file on disk*.jsonl
CodexStreamed to stdoutJSON lines
OpenCodeStreamed to stdoutDifferent JSON schema

Four‑Stage Normalisation Pipeline

  1. Capture – while the sandbox is still running (step 5 of the lifecycle).

    • Claude Code writes a JSONL file → we read it after the run.
    • Codex & OpenCode stream JSON lines to stdout → we capture and filter those lines.
  2. Raw JSONL Consolidation – all agents now produce a single string of raw JSONL lines.

  3. Agent‑Specific Parsing – each parser does two things:

    • Tool‑name normalisation – map agent‑specific names to a set of ~10 canonical names.
    • Message‑shape flattening – collapse agent‑specific nesting into a unified TranscriptEvent type.
    // Example lookup table (partial)
    const TOOL_MAP = {
      search:   "web_fetch",
      http_get: "web_fetch",
      fs_write: "file_write",
      // …
    };
  4. Post‑Processing – enrich the TranscriptEvent[] with structured metadata (e.g., extract file paths from args.path vs. args.file).

The resulting array (TranscriptEvent[]) is fed into the same brand‑extraction pipeline used for standard model responses, making the downstream system agnostic to the source (model API vs. coding agent).

End‑to‑End Workflow

When a prompt is tagged as type: "agents", the Vercel workflow fans out across all configured agents in parallel, each running in its own sandbox.

flowchart TD
    Prompt["prompt"] --> Vercel["Vercel Workflow"]
    Vercel --> A["Agent A (sandbox)"]
    Vercel --> B["Agent B (sandbox)"]
    Vercel --> C["…"]
    
    A --> A1["transcript"]
    A1 --> A2["normalised events"]
    
    B --> B1["transcript"]
    B1 --> B2["normalised events"]
    
    C --> C1["transcript"]
    C1 --> C2["normalised events"]
    
    Vercel --> Stats["Aggregate stats (tool calls, web fetches, errors)"]
    Stats --> Brand["Brand extraction pipeline"]
  • PromptVercel Workflow
  • Each Agent runs in its own sandbox, producing a transcript that is turned into normalized events.
  • The workflow then aggregates statistics (tool calls, web fetches, errors) and passes the results to the Brand extraction pipeline.

Further Reading

  • Execution Isolation – How to safely run an autonomous agent that can execute arbitrary code.
  • Capturing Agent Activity – Techniques for reliably recording what the agent did when each step completes.

Read more →


Agent‑as‑Code (AEO) Lifecycle

How we run, observe, and extract insights from coding agents.


1. Observability

Spin up a fresh MicroVM

  • Choose the right runtime (Node 24, Python 3.13, etc.).
  • Set a hard timeout – the sandbox will kill the agent if it hangs or loops.

Create the sandbox

  • Each agent ships as an npm package (e.g., @anthropic-ai/claude-code, @openai/codex, @vercel/open‑code).
  • The sandbox installs the package globally so the CLI is available as a shell command.

Install the agent CLI

npm install -g @anthropic-ai/claude-code   # Claude Code
npm install -g @openai/codex               # Codex
npm install -g @vercel/open-code           # OpenCode

Inject credentials

  • Instead of giving each agent a direct provider API key, set environment variables that route all LLM calls through Vercel AI Gateway.
  • Benefits: unified logging, rate‑limiting, and cost tracking across every agent (even when they use different underlying providers).
  • Direct provider keys are still supported if needed.

Run the agent

The only step that differs per agent is the CLI invocation pattern, flags, and config format. From the sandbox’s perspective it is just a shell command, e.g.:

# Claude Code
claude-code --prompt "Write a function that parses CSV"

# Codex
codex run --task "Generate a React component"

Capture the transcript

After the agent finishes, extract a record of what it did:

  • Which tools it called.
  • Whether it performed a web search.
  • What it recommended in the response.

Note: This step is agent‑specific (see “Transcript capture” below).

Tear down

  • The sandbox is always stopped (even on error) so resources are never leaked.

2. Transcript Capture

Each agent stores its transcript differently, so we provide a per‑agent parser that normalises:

  1. Tool names – map the agent‑specific identifiers to a unified set.
  2. Message shapes – flatten agent‑specific structures into a single unified event type.

Parsing

  • Shared post‑processing extracts structured metadata (e.g., URLs, commands) from tool arguments and normalises naming differences.

Enrichment

  • Aggregate the unified events into statistics.
  • Feed the enriched data into the same brand‑extraction pipeline used for standard model responses.

Summary & Brand Extraction

  • The final stage produces a concise summary and extracts brand mentions, enabling direct comparison between agents and vanilla LLMs.

3. Lifecycle Stages (Unified View)

StageDescription
Stage 1 – Transcript capturePull the raw agent output (tool calls, web searches, recommendations).
Stage 2 – Parsing tool names & message shapesNormalise tool identifiers and flatten message structures.
Stage 3 – EnrichmentAdd structured metadata (URLs, commands) and compute stats.
Stage 4 – Summary & brand extractionProduce a human‑readable summary and run brand extraction.

4. Tool Mapping (Agent‑specific → Unified)

Unified ActionClaude CodeCodexOpenCode
Read a fileReadread_fileread
Write a fileWritewrite_filewrite
Edit a fileStrReplacepatch_filepatch
Run a commandBashshellbash
Search the webWebFetch(varies)(varies)

5. Agent‑Specific Transcript Details

AgentTranscript quirks
Claude CodeNests messages inside a content property and mixes tool_use blocks into content arrays.
CodexEmits Responses API lifecycle events (thread.started, turn.completed, output_text.delta) alongside tool events.
OpenCodeBundles tool call and result in the same event via part.tool and part.state.

6. Observations & Findings

  • Search frequency – Early tests on a random sample of prompts showed coding agents execute a web search ≈ 20 % of the time. As we collect more data we’ll build a comprehensive view of agent search behaviour.
  • Tool recommendations – When an agent suggests a tool, it usually emits working code (e.g., an import statement, a config file, or a deployment script). The recommendation is embedded in the output, not merely mentioned in prose.
  • Normalization importance – Agent CLI tools ship rapid updates, causing transcript formats to diverge quickly. Building a normalization layer early saved us from constant breakage.

“Transcript formats are a mess. The hard part is everything upstream: getting the agent to run, capturing what it did, and normalising it into a structure you can grade.” – Team note

7. Future Work

  1. Open‑source the system – Release an OSS version so other teams can run their own AEO evaluations for both standard models and coding agents.
  2. Full AEO eval methodology – Publish a follow‑up post covering:
    • Prompt design.
    • Dual‑mode testing (web search vs. training‑data recall).
    • Query‑as‑first‑class‑entity architecture.
    • Share‑of‑Voice metrics.
  3. Scale agent coverage – Add more agents as the ecosystem grows and expand prompt types (e.g., full project scaffolding, debugging, performance tuning).

8. Reference List

  • Claude Code@anthropic-ai/claude-code
  • Codex@openai/codex
  • OpenCode@vercel/open-code

Prepared for internal documentation and future open‑source release.

0 views
Back to Blog

Related posts

Read more »