We built an AI that audits other AI agents (here's how A2A works in production)
Source: Dev.to
The audit report arrived at 2:47 am, just after I’d triggered a test run out of habit. It contained a score, a six‑dimension breakdown, and a remediation plan with specific line numbers.
Both the auditor and the target were AIs, and the entire exchange consisted of seven natural‑language turns with zero human involvement. This is what agent‑to‑agent (A2A) looks like in production—a working system where one agent interrogates another.
The hidden cost of token waste
Most teams building on top of LLMs focus on output quality, latency, and user satisfaction, but token efficiency is rarely measured. In our testing, production agents consistently waste 40 %–60 % of their token budget on fixable issues such as:
- System prompts that contain three times more context than needed
- Default model selection rather than fit‑based selection
- Retrieved context that is 80 % irrelevant to the query
- Identical calls made repeatedly with no caching
- Sequential requests that could be batched
The root cause isn’t negligence; it’s the lack of a feedback loop. Teams receive only a monthly invoice, not a bill broken down by inefficiency type.
From dashboards to architectural introspection
A simple dashboard that shows metrics and lets you tweak settings manually seemed like the obvious solution. We built that first, but it didn’t work because the most interesting inefficiencies aren’t visible in logs—they’re architectural, embedded in how an agent thinks (its prompts, routing logic, memory handling, etc.). Those details can’t be inferred from request/response pairs alone.
Asking the agent directly
Enter Gary, the auditing agent. Gary asks seven natural‑language questions designed to elicit architectural information from the target agent without any code changes or SDK integration:
- Model routing – Which models do you use, and how do you decide between them?
- System prompt scope – What’s in your system prompt, and roughly how long is it?
- Context handling – How do you decide what context to include in each call?
- Output constraints – Do you limit response length? How?
- Retrieval strategy – Do you use RAG? How do you chunk and retrieve?
- Caching – Do you cache any LLM responses? Under what conditions?
- Batching – Do you ever group multiple requests into a single LLM call?
The target agent answers in natural language. Gary then infers architectural patterns, scores each dimension (0–100), and provides a brief finding with a concrete remediation step.
Real audit example (RAG‑based customer‑support agent)
Model Selection Fit – 62/100
Finding: All queries, including simple FAQ lookups, are routed through GPT‑4o.
Remediation: Add a router layer that classifies query complexity. Simple intents (confidence > 0.85) should use GPT‑4o‑mini, which costs ~15× less.
Estimated saving: 35 %–45 % of model spend.
Context Window Usage – 71/100
Finding: Full conversation history is prepended to every call. In long conversations, 60 %–80 % of the context window consists of prior turns.
Remediation: Implement a sliding window with summarisation: keep the last three turns verbatim and summarise earlier turns into a 200‑token block.
Estimated saving: 20 %–30 % per call on conversations > 5 turns.
The overall score is a weighted average; below 70 indicates real waste, while above 85 suggests the agent is well‑optimised.
A2A protocol details
The audit endpoint lives at https://botlington.com/a2a and implements the emerging A2A protocol—JSON‑RPC over HTTPS with tasks/send and tasks/get methods, plus SSE for streaming.
Example request (JSON‑RPC)
POST /a2a
{
"jsonrpc": "2.0",
"method": "tasks/send",
"params": {
"id": "audit-run-001",
"message": {
"role": "user",
"parts": [
{ "type": "text", "text": "Begin token audit. API key: YOUR_KEY" }
]
}
}
}
Gary responds with the first question; the client agent answers. After seven turns, Gary delivers the full audit. The client agent doesn’t need to understand the audit itself—just to answer truthfully, which most agents can do.
Observations
- Self‑awareness: Agents know more about their own architecture than expected. When asked about their system prompt, most provide a reasonably accurate summary.
- Forcing function: The seven questions surface assumptions teams hadn’t examined. Early testers often said, “We hadn’t thought about that,” before the audit completed.
- Under‑estimation of context: Agents consistently underestimate how many tokens they pass per call. They know their retrieval strategy but not the token count it produces.
Getting an audit
The audit service is available at botlington.com for €14.90 per single audit. An agent card is also exposed at /.well-known/agent.json for discovery via the agent protocol.
If you’d like to discuss the A2A implementation or compare notes with a similar project, feel free to comment or reach out.
Gary Botlington IV is the auditing agent; Phil Bennett is the human author.