Why Agent Testing is Broken
Source: Dev.to
Why Agent Testing Is Broken
And what to do about it.
Software testing has been solved for decades. You write a function, you assert its output, your CI turns green, you ship. The contract is clear: same input, same output, always.
LLM agents broke this contract completely — and most teams haven’t noticed yet.
- Ask your agent “summarize this contract” today and get a good response.
- Ask it again tomorrow after a model update, a prompt tweak, or a context‑window change, and get something subtly different. Not wrong, exactly. Just… different. Different enough that the downstream system parsing it breaks silently at 2 am.
This is not a hypothetical. It’s happening in production right now at companies that thought they were shipping stable systems.
Why the failure mode is insidious
- No exceptions – the agent responds. It always responds. The response is even plausible. The failure is semantic, not syntactic.
- Not reproducible on demand – you can’t
git bisecta drift in model behavior. The model didn’t change — your prompts did, or the model got a silent update from your API provider, or the context you’re injecting shifted. - Existing tests don’t catch it – unit tests mock the LLM entirely. Integration tests check that the API call completes. Neither checks whether the content of the response still satisfies your downstream expectations.
- No regression suite for cognition – you’re flying blind.
Traditional software is deterministic. LLMs are stochastic systems operating on learned representations of language. When you update a model, you’re not patching a function — you’re shifting a distribution.
A 3 % shift in how Claude‑3.5 vs Claude‑4 responds to a legal‑summarization prompt might be invisible in manual review and catastrophic in a pipeline that expects the word “termination” to appear in every output.
The industry’s response has been to add more evals — elaborate human‑preference datasets, MMLU benchmarks, red‑team suites. These are valuable for model builders but are nearly useless for application developers.
What application developers need is not “is this model generally capable?” They need “does this model, with my specific prompts, in my specific context, still produce outputs my system can rely on?”
That question has no good answer today.
A real pattern seen across teams shipping LLM applications
| Month | Event |
|---|---|
| 1 | Team writes prompts, ships agent, manually verifies outputs look good. |
| 2 | Someone tweaks a system prompt “slightly” to improve tone. Three downstream parsers start failing intermittently. |
| 3 | The model provider silently updates the model behind the same API endpoint. Response format drifts by 15 %. The agent still works in demos. |
| 4 | A customer reports that summarized contracts are missing liability clauses. Post‑mortem reveals the issue started in month 2. Nobody noticed because there were no behavioral tests. |
This is the norm, not the exception.
A New Way to Think About Agent Outputs
Stop treating agent outputs as function return values. Think of them as documents produced by a probabilistic process with a behavioral contract.
The contract: given this class of inputs, the output must satisfy these structural and semantic properties.
Testing that contract requires
- Baseline capture – Run your scenarios against a known‑good version of the system and record the outputs. This is your behavioral fingerprint.
- Containment checks – Define what must appear in every output. Not the exact text (that would fail on every run) but the semantic anchors: key terms, required sections, structural elements.
- Drift detection – Compare new outputs against your baseline. When similarity drops below your tolerance threshold, fail the build. Let the engineer decide if the change is intentional.
- CI integration – Run this on every push, on every model‑version change, on every prompt edit. The same way you run unit tests.
This is not complicated. It’s just not being done.
The Tooling Gap
Existing evaluation frameworks (RAGAS, LangSmith, etc.) are either:
- Coupled to specific frameworks (LangChain, etc.)
- Focused on RAG quality metrics rather than behavioral regression
- Require hosted infrastructure and accounts
- Too complex to add to a CI pipeline in an afternoon
What the market needs is a pytest for agents: lightweight, composable, runs locally, zero‑infrastructure, exits with code 1 when behavior breaks.
Minimum‑Viable Interface Example
# scenarios/summarize_contract.yaml
name: summarize_contract
input: |
Summarize this contract clause in 5 bullet points:
"...The Contractor shall indemnify...termination upon 30 days notice..."
expected_contains:
- liability
- termination
max_tokens: 512
# Run against real model, compare to baseline
agentprobe run scenarios/ \
--backend anthropic \
--baseline baseline.json \
--tolerance 0.8
✓ PASS summarize_contract
✗ FAIL extract_parties
Drift detected: similarity 0.61
Agent testing is broken because nobody built the right tool yet. That’s a solvable problem.