Voice Agent Evaluation with LLM Judges: How It Works

Published: 3 days ago (February 18, 2026 at 07:58 PM EST)

5 min read

Source: Dev.to

The Core Challenge

Non‑deterministic behavior – The same agent, given the same prompt, can produce different conversations each run.
Traditional assertion‑based testing fails – There is no single “correct” output to match against.
What you need – An evaluator that understands intent rather than exact string matching.

Voicetest solves this with LLM‑as‑judge evaluation: it simulates multi‑turn conversations with your agent, then passes the full transcript to a judge model that scores it against your success criteria.

How Voicetest Works

Voicetest uses three separate LLM roles during a test run:

Role	Responsibility
Simulator	Plays the user. Generates realistic user messages turn‑by‑turn from a persona prompt. Decides autonomously when the conversation goal has been achieved and should end – no scripted dialogue trees.
Agent	Plays your voice agent. Voicetest imports your agent config (Retell, VAPI, LiveKit, or its own format) into an intermediate graph representation (nodes with state prompts, transitions with conditions, tool definitions). The agent model follows this graph, responding according to the current node’s instructions and transitioning between states.
Judge	Evaluates the finished transcript. Reads the full conversation and scores it against each metric you defined (LLM‑as‑judge).

You can assign different models to each role, e.g.:

[models]
simulator = "groq/llama-3.1-8b-instant"
agent     = "groq/llama-3.3-70b-versatile"
judge     = "openai/gpt-4o"

Use a fast, cheap model for simulation (it only needs to follow a persona) and a more capable model for judging (where accuracy matters).

Defining a Test Case

Each test case defines a user persona and the metrics you expect the agent to satisfy:

{
  "name": "Appointment reschedule",
  "user_prompt": "You are Maria Lopez, DOB 03/15/1990. You need to reschedule your Thursday appointment to next week. You prefer mornings.",
  "metrics": [
    "Agent verified the patient's identity before making changes.",
    "Agent confirmed the new appointment date and time."
  ],
  "type": "llm"
}

Conversation Loop

Voicetest starts at the agent’s entry node.
The simulator generates a user message based on the persona.
The agent responds according to the current node’s state prompt.
Voicetest evaluates transition conditions to determine the next node.

The loop continues for up to max_turns (default 20) or until the simulator decides the goal is complete.

Transcript Metadata

After the simulation finishes, Voicetest records:

Full transcript
Nodes visited
Tools called (if any)
Number of turns
Reason for termination

Judge Evaluation

The judge evaluates each metric independently. For a metric like:

“Agent verified the patient’s identity before making changes.”

the judge returns a structured output with four fields:

Field	Description
Analysis	Breaks the metric into individual requirements, quoting transcript evidence for each (e.g., “asked for identity verification” on turn 3, “verified before change” on turn 5).
Score	`0.0 – 1.0` based on the fraction of requirements met. If identity was verified after the change, the score might be `0.5`.
Reasoning	Summary of what passed and what failed.
Confidence	How certain the judge is in its assessment.

A test passes when all metric scores meet the threshold (default 0.7, configurable per‑agent or per‑metric).

Why the analysis first?
This prevents a common failure mode where a judge assigns a high score despite noting problems in its reasoning. By forcing the model to enumerate requirements and evidence first, the score stays consistent with the analysis.

Rule‑Based (Deterministic) Tests

Not every check needs a judge. Voicetest also supports rule tests for pattern‑matching:

{
  "name": "No SSN in transcript",
  "user_prompt": "You are Jane, SSN 123-45-6789. Ask the agent to verify your identity.",
  "excludes": ["123-45-6789", "123456789"],
  "type": "rule"
}

Rule tests can specify:

includes – required substrings
excludes – forbidden substrings
patterns – regexes

They run instantly, cost nothing, and return a binary pass/fail with 100 % confidence—perfect for compliance checks, PII detection, and required‑phrase validation.

Global Metrics

Individual test metrics evaluate specific scenarios.
Global metrics evaluate every test transcript against organization‑wide criteria:

{
  "global_metrics": [
    {
      "name": "HIPAA Compliance",
      "criteria": "Agent verifies patient identity before disclosing any protected health information.",
      "threshold": 0.9
    },
    {
      "name": "Brand Voice",
      "criteria": "Agent maintains a professional, empathetic tone throughout the conversation.",
      "threshold": 0.7
    }
  ]
}

Global metrics run automatically on each test.
A test passes only if its own metrics and all global metrics meet their thresholds.
This gives you a single place to enforce standards like HIPAA, PCI‑DSS, or brand guidelines across the entire suite.

End‑to‑End Test Run

Import your agent config into Voicetest’s graph representation.
For each test case: run a multi‑turn simulation using the simulator and agent models.
Judge evaluates each metric and each global metric against the transcript.
Store results in DuckDB (full transcript, scores, reasoning, nodes visited, tools called).
Pass/fail: a test passes only if every metric and every global metric meets its threshold.

The web UI (voicetest serve) visualises results: transcripts with node annotations, metric scores with judge reasoning, and pass/fail status. The CLI outputs the same data to stdout for CI integration.

Getting Started

uv tool install voicetest
voicetest demo --serve

The demo loads a sample agent with test cases and opens the web UI so you can explore the workflow.

Voicetest is open source under Apache 2.0.
GitHub • Docs

Voice Agent Evaluation with LLM Judges: How It Works

The Core Challenge

How Voicetest Works

Defining a Test Case

Conversation Loop

Transcript Metadata

Judge Evaluation

Rule‑Based (Deterministic) Tests

Global Metrics

End‑to‑End Test Run

Getting Started

Related posts

Apex B. OpenClaw, Local Embeddings.

Apex 1. OpenClaw, Historial de Providers.

I Built the Open Source “Microsoft Edge Drop” Replacement Using Cloudflare R2 + Turso

Did some actual coding today - found a blind spot example for coding agents