I built a pytest-like tool for AI agents because 'it passed once' isn't good enough
Source: Dev.to
Introduction
You know that feeling when your AI agent works perfectly in development, then randomly breaks in production? Same prompt, same model, different results. I spent way too much time debugging agents that sometimes failed. The worst part wasn’t the failures—it was not knowing why. Was it the tool selection? The prompt? The model having a bad day?
Existing evaluation tools didn’t help much. They run your test once, check the output, and stop. But agents aren’t deterministic. Running a test once tells you almost nothing.
agentrial
I built agentrial, a pytest‑like framework for AI agents that runs each test multiple times and provides real statistics.
Installation
pip install agentrial
Configuration (agentrial.yml)
suite: my-agent
agent: my_module.agent
trials: 10
threshold: 0.85
cases:
- name: basic-math
input:
query: "What is 15 * 37?"
expected:
output_contains: ["555"]
tool_calls:
- tool: calculate
Running Tests
agentrial run
Output
┌──────────────────────┬────────┬──────────────┬──────────┐
│ Test Case │ Pass │ 95% CI │ Avg Cost │
├──────────────────────┼────────┼──────────────┼──────────┤
│ easy-multiply │ 100.0% │ 72.2%-100.0% │ $0.0005 │
│ medium-population │ 90.0% │ 59.6%-98.2% │ $0.0006 │
│ hard-multi-step │ 70.0% │ 39.7%-89.2% │ $0.0011 │
└──────────────────────┴────────┴──────────────┴──────────┘
Confidence intervals
The “95% CI” column shows Wilson score intervals. With 10 trials, a 100 % pass rate actually means somewhere between 72 % and 100 % with 95 % confidence. Seeing “100 % (72‑100 %)” instead of just “100 %” completely changed how I think about agent reliability.
Step‑level failure attribution
When a test fails, agentrial tells you which step diverged:
Failures: medium-population (90% pass rate)
Step 0 (tool_selection): called 'calculate' instead of 'lookup_country_info'
In my case, the agent was occasionally picking the wrong tool on ambiguous queries—a bug that would have taken hours to discover manually.
Real cost tracking
agentrial pulls token usage from the API response metadata. Running 100 trials across 10 test cases cost me only 6 cents total, giving precise per‑test cost estimates before scaling.
CI Integration
A GitHub Action can enforce reliability thresholds on every PR:
- uses: alepot55/agentrial@v0.1.4
with:
trials: 10
threshold: 0.80
If the pass rate drops below 80 %, the PR is blocked. This caught two regressions last week that would have shipped otherwise.
Current Limitations
- Supports LangGraph only at the moment; adapters for CrewAI and AutoGen are planned.
- CLI‑only; no graphical UI yet.
- No LLM‑as‑judge for semantic evaluation (coming later).
Open Source
agentrial is MIT‑licensed and available at .
Closing Thoughts
I built the whole thing in about a week using Claude Code. The statistical components (Wilson intervals, Fisher exact test for regression detection, Benjamini‑Hochberg correction) were the fun part. If you’re building agents and tired of “it works on my machine”, give agentrial a try. I’m still figuring out which metrics are most useful, so feel free to share your ideas!