We built an AI that audits other AI agents (here's how A2A works in production)

Published: 1 month ago (March 18, 2026 at 03:08 AM EDT)

4 min read

Source: Dev.to

Source: Dev.to

The audit report arrived at 2:47 am, just after I’d triggered a test run out of habit. It contained a score, a six‑dimension breakdown, and a remediation plan with specific line numbers.
Both the auditor and the target were AIs, and the entire exchange consisted of seven natural‑language turns with zero human involvement. This is what agent‑to‑agent (A2A) looks like in production—a working system where one agent interrogates another.

The hidden cost of token waste

Most teams building on top of LLMs focus on output quality, latency, and user satisfaction, but token efficiency is rarely measured. In our testing, production agents consistently waste 40 %–60 % of their token budget on fixable issues such as:

System prompts that contain three times more context than needed
Default model selection rather than fit‑based selection
Retrieved context that is 80 % irrelevant to the query
Identical calls made repeatedly with no caching
Sequential requests that could be batched

The root cause isn’t negligence; it’s the lack of a feedback loop. Teams receive only a monthly invoice, not a bill broken down by inefficiency type.

From dashboards to architectural introspection

A simple dashboard that shows metrics and lets you tweak settings manually seemed like the obvious solution. We built that first, but it didn’t work because the most interesting inefficiencies aren’t visible in logs—they’re architectural, embedded in how an agent thinks (its prompts, routing logic, memory handling, etc.). Those details can’t be inferred from request/response pairs alone.

Asking the agent directly

Enter Gary, the auditing agent. Gary asks seven natural‑language questions designed to elicit architectural information from the target agent without any code changes or SDK integration:

Model routing – Which models do you use, and how do you decide between them?
System prompt scope – What’s in your system prompt, and roughly how long is it?
Context handling – How do you decide what context to include in each call?
Output constraints – Do you limit response length? How?
Retrieval strategy – Do you use RAG? How do you chunk and retrieve?
Caching – Do you cache any LLM responses? Under what conditions?
Batching – Do you ever group multiple requests into a single LLM call?

The target agent answers in natural language. Gary then infers architectural patterns, scores each dimension (0–100), and provides a brief finding with a concrete remediation step.

Real audit example (RAG‑based customer‑support agent)

Model Selection Fit – 62/100

Finding: All queries, including simple FAQ lookups, are routed through GPT‑4o.
Remediation: Add a router layer that classifies query complexity. Simple intents (confidence > 0.85) should use GPT‑4o‑mini, which costs ~15× less.
Estimated saving: 35 %–45 % of model spend.

Context Window Usage – 71/100

Finding: Full conversation history is prepended to every call. In long conversations, 60 %–80 % of the context window consists of prior turns.
Remediation: Implement a sliding window with summarisation: keep the last three turns verbatim and summarise earlier turns into a 200‑token block.
Estimated saving: 20 %–30 % per call on conversations > 5 turns.

The overall score is a weighted average; below 70 indicates real waste, while above 85 suggests the agent is well‑optimised.

A2A protocol details

The audit endpoint lives at https://botlington.com/a2a and implements the emerging A2A protocol—JSON‑RPC over HTTPS with tasks/send and tasks/get methods, plus SSE for streaming.

Example request (JSON‑RPC)

POST /a2a
{
  "jsonrpc": "2.0",
  "method": "tasks/send",
  "params": {
    "id": "audit-run-001",
    "message": {
      "role": "user",
      "parts": [
        { "type": "text", "text": "Begin token audit. API key: YOUR_KEY" }
      ]
    }
  }
}

Gary responds with the first question; the client agent answers. After seven turns, Gary delivers the full audit. The client agent doesn’t need to understand the audit itself—just to answer truthfully, which most agents can do.

Observations

Self‑awareness: Agents know more about their own architecture than expected. When asked about their system prompt, most provide a reasonably accurate summary.
Forcing function: The seven questions surface assumptions teams hadn’t examined. Early testers often said, “We hadn’t thought about that,” before the audit completed.
Under‑estimation of context: Agents consistently underestimate how many tokens they pass per call. They know their retrieval strategy but not the token count it produces.

Getting an audit

The audit service is available at botlington.com for €14.90 per single audit. An agent card is also exposed at /.well-known/agent.json for discovery via the agent protocol.

If you’d like to discuss the A2A implementation or compare notes with a similar project, feel free to comment or reach out.

Gary Botlington IV is the auditing agent; Phil Bennett is the human author.

We built an AI that audits other AI agents (here's how A2A works in production)

The hidden cost of token waste

From dashboards to architectural introspection

Asking the agent directly

Real audit example (RAG‑based customer‑support agent)

Model Selection Fit – 62/100

Context Window Usage – 71/100

A2A protocol details

Example request (JSON‑RPC)

Observations

Getting an audit

Related posts

AI-Safe MCP Server for SQL

The 5 LLM Architecture Patterns That Scale (And 2 That Do Not)

robots.txt is a sign, not a fence: 8 technical vectors through which AI still reads your website

Stop Writing AI Agent Prompts Like It's 2023: The Framework That Makes Your OpenClaw Agent Actually Work