I built a pytest-like tool for AI agents because 'it passed once' isn't good enough

Published: 1 hour ago (February 5, 2026 at 03:10 PM EST)

3 min read

Source: Dev.to

Introduction

You know that feeling when your AI agent works perfectly in development, then randomly breaks in production? Same prompt, same model, different results. I spent way too much time debugging agents that sometimes failed. The worst part wasn’t the failures—it was not knowing why. Was it the tool selection? The prompt? The model having a bad day?

Existing evaluation tools didn’t help much. They run your test once, check the output, and stop. But agents aren’t deterministic. Running a test once tells you almost nothing.

agentrial

I built agentrial, a pytest‑like framework for AI agents that runs each test multiple times and provides real statistics.

Installation

pip install agentrial

Configuration (`agentrial.yml`)

suite: my-agent
agent: my_module.agent
trials: 10
threshold: 0.85

cases:
  - name: basic-math
    input:
      query: "What is 15 * 37?"
    expected:
      output_contains: ["555"]
      tool_calls:
        - tool: calculate

Running Tests

agentrial run

Output

┌──────────────────────┬────────┬──────────────┬──────────┐
│ Test Case            │ Pass   │ 95% CI       │ Avg Cost │
├──────────────────────┼────────┼──────────────┼──────────┤
│ easy-multiply        │ 100.0% │ 72.2%-100.0% │ $0.0005  │
│ medium-population    │ 90.0%  │ 59.6%-98.2%  │ $0.0006  │
│ hard-multi-step      │ 70.0%  │ 39.7%-89.2%  │ $0.0011  │
└──────────────────────┴────────┴──────────────┴──────────┘

Confidence intervals

The “95% CI” column shows Wilson score intervals. With 10 trials, a 100 % pass rate actually means somewhere between 72 % and 100 % with 95 % confidence. Seeing “100 % (72‑100 %)” instead of just “100 %” completely changed how I think about agent reliability.

Step‑level failure attribution

When a test fails, agentrial tells you which step diverged:

Failures: medium-population (90% pass rate)
  Step 0 (tool_selection): called 'calculate' instead of 'lookup_country_info'

In my case, the agent was occasionally picking the wrong tool on ambiguous queries—a bug that would have taken hours to discover manually.

Real cost tracking

agentrial pulls token usage from the API response metadata. Running 100 trials across 10 test cases cost me only 6 cents total, giving precise per‑test cost estimates before scaling.

CI Integration

A GitHub Action can enforce reliability thresholds on every PR:

- uses: alepot55/agentrial@v0.1.4
  with:
    trials: 10
    threshold: 0.80

If the pass rate drops below 80 %, the PR is blocked. This caught two regressions last week that would have shipped otherwise.

Current Limitations

Supports LangGraph only at the moment; adapters for CrewAI and AutoGen are planned.
CLI‑only; no graphical UI yet.
No LLM‑as‑judge for semantic evaluation (coming later).

Open Source

agentrial is MIT‑licensed and available at .

Closing Thoughts

I built the whole thing in about a week using Claude Code. The statistical components (Wilson intervals, Fisher exact test for regression detection, Benjamini‑Hochberg correction) were the fun part. If you’re building agents and tired of “it works on my machine”, give agentrial a try. I’m still figuring out which metrics are most useful, so feel free to share your ideas!

I built a pytest-like tool for AI agents because 'it passed once' isn't good enough

Introduction

agentrial

Installation

Configuration (`agentrial.yml`)

Running Tests

Output

Confidence intervals

Step‑level failure attribution

Real cost tracking

CI Integration

Current Limitations

Open Source

Closing Thoughts

Related posts

Is Your Shopify Store Ready for AI Agents?

Software Autonomy: A Cost Reassessment for Engineering Leaders

'key Cloud Computing Concepts Every Beginner Should Know'

I built an Open-Source CLI that stops your AI terminal from leaking secrets

Introduction

agentrial

Installation

Configuration (agentrial.yml)

Running Tests

Output

Confidence intervals

Step‑level failure attribution

Real cost tracking

CI Integration

Current Limitations

Open Source

Closing Thoughts

Related posts

Is Your Shopify Store Ready for AI Agents?

Software Autonomy: A Cost Reassessment for Engineering Leaders

'key Cloud Computing Concepts Every Beginner Should Know'

I built an Open-Source CLI that stops your AI terminal from leaking secrets

Configuration (`agentrial.yml`)