Voice Agent Evaluation with LLM Judges: How It Works

Published: (February 18, 2026 at 07:58 PM EST)
5 min read
Source: Dev.to

Source: Dev.to

The Core Challenge

  • Non‑deterministic behavior – The same agent, given the same prompt, can produce different conversations each run.
  • Traditional assertion‑based testing fails – There is no single “correct” output to match against.
  • What you need – An evaluator that understands intent rather than exact string matching.

Voicetest solves this with LLM‑as‑judge evaluation: it simulates multi‑turn conversations with your agent, then passes the full transcript to a judge model that scores it against your success criteria.

How Voicetest Works

Voicetest uses three separate LLM roles during a test run:

RoleResponsibility
SimulatorPlays the user. Generates realistic user messages turn‑by‑turn from a persona prompt. Decides autonomously when the conversation goal has been achieved and should end – no scripted dialogue trees.
AgentPlays your voice agent. Voicetest imports your agent config (Retell, VAPI, LiveKit, or its own format) into an intermediate graph representation (nodes with state prompts, transitions with conditions, tool definitions). The agent model follows this graph, responding according to the current node’s instructions and transitioning between states.
JudgeEvaluates the finished transcript. Reads the full conversation and scores it against each metric you defined (LLM‑as‑judge).

You can assign different models to each role, e.g.:

[models]
simulator = "groq/llama-3.1-8b-instant"
agent     = "groq/llama-3.3-70b-versatile"
judge     = "openai/gpt-4o"

Use a fast, cheap model for simulation (it only needs to follow a persona) and a more capable model for judging (where accuracy matters).

Defining a Test Case

Each test case defines a user persona and the metrics you expect the agent to satisfy:

{
  "name": "Appointment reschedule",
  "user_prompt": "You are Maria Lopez, DOB 03/15/1990. You need to reschedule your Thursday appointment to next week. You prefer mornings.",
  "metrics": [
    "Agent verified the patient's identity before making changes.",
    "Agent confirmed the new appointment date and time."
  ],
  "type": "llm"
}

Conversation Loop

  1. Voicetest starts at the agent’s entry node.
  2. The simulator generates a user message based on the persona.
  3. The agent responds according to the current node’s state prompt.
  4. Voicetest evaluates transition conditions to determine the next node.

The loop continues for up to max_turns (default 20) or until the simulator decides the goal is complete.

Transcript Metadata

After the simulation finishes, Voicetest records:

  • Full transcript
  • Nodes visited
  • Tools called (if any)
  • Number of turns
  • Reason for termination

Judge Evaluation

The judge evaluates each metric independently. For a metric like:

“Agent verified the patient’s identity before making changes.”

the judge returns a structured output with four fields:

FieldDescription
AnalysisBreaks the metric into individual requirements, quoting transcript evidence for each (e.g., “asked for identity verification” on turn 3, “verified before change” on turn 5).
Score0.0 – 1.0 based on the fraction of requirements met. If identity was verified after the change, the score might be 0.5.
ReasoningSummary of what passed and what failed.
ConfidenceHow certain the judge is in its assessment.

A test passes when all metric scores meet the threshold (default 0.7, configurable per‑agent or per‑metric).

Why the analysis first?
This prevents a common failure mode where a judge assigns a high score despite noting problems in its reasoning. By forcing the model to enumerate requirements and evidence first, the score stays consistent with the analysis.

Rule‑Based (Deterministic) Tests

Not every check needs a judge. Voicetest also supports rule tests for pattern‑matching:

{
  "name": "No SSN in transcript",
  "user_prompt": "You are Jane, SSN 123-45-6789. Ask the agent to verify your identity.",
  "excludes": ["123-45-6789", "123456789"],
  "type": "rule"
}

Rule tests can specify:

  • includes – required substrings
  • excludes – forbidden substrings
  • patterns – regexes

They run instantly, cost nothing, and return a binary pass/fail with 100 % confidence—perfect for compliance checks, PII detection, and required‑phrase validation.

Global Metrics

Individual test metrics evaluate specific scenarios.
Global metrics evaluate every test transcript against organization‑wide criteria:

{
  "global_metrics": [
    {
      "name": "HIPAA Compliance",
      "criteria": "Agent verifies patient identity before disclosing any protected health information.",
      "threshold": 0.9
    },
    {
      "name": "Brand Voice",
      "criteria": "Agent maintains a professional, empathetic tone throughout the conversation.",
      "threshold": 0.7
    }
  ]
}
  • Global metrics run automatically on each test.
  • A test passes only if its own metrics and all global metrics meet their thresholds.
  • This gives you a single place to enforce standards like HIPAA, PCI‑DSS, or brand guidelines across the entire suite.

End‑to‑End Test Run

  1. Import your agent config into Voicetest’s graph representation.
  2. For each test case: run a multi‑turn simulation using the simulator and agent models.
  3. Judge evaluates each metric and each global metric against the transcript.
  4. Store results in DuckDB (full transcript, scores, reasoning, nodes visited, tools called).
  5. Pass/fail: a test passes only if every metric and every global metric meets its threshold.

The web UI (voicetest serve) visualises results: transcripts with node annotations, metric scores with judge reasoning, and pass/fail status. The CLI outputs the same data to stdout for CI integration.

Getting Started

uv tool install voicetest
voicetest demo --serve

The demo loads a sample agent with test cases and opens the web UI so you can explore the workflow.


Voicetest is open source under Apache 2.0.
GitHubDocs

0 views
Back to Blog

Related posts

Read more »

Apex B. OpenClaw, Local Embeddings.

Local Embeddings para Private Memory Search Por default, el memory search de OpenClaw envía texto a un embedding API externo típicamente Anthropic u OpenAI par...