My AI agent cost me $400 overnight, so I built pytest for agents and open-sourced it

Published: (December 8, 2025 at 04:52 AM EST)
2 min read
Source: Dev.to

Source: Dev.to

Background

At 2 am I was staring at my OpenAI dashboard, wondering how my bill jumped from $80 to $400 in a single day. After six months of running custom AI agents in production, I learned the hard way that agents that work perfectly on a local machine can betray you in production.

Introducing EvalView

I needed something dead simple: write down what the agent is supposed to do, run it, and fail the build if it does something stupid. The idea is embarrassingly simple—describe the expected behavior in a YAML file and let the test framework enforce it.

Example Test

name: order lookup
input:
  query: "What's the status of order 12345?"
expected:
  tools:
    - get_order_status
thresholds:
  max_cost: 0.10
  • If the agent answers without calling get_order_status, the test fails.
  • If it costs more than 10 cents, the test fails.

A red error in CI breaks the build and blocks deployment.

Getting Started

pip install evalview

Quickstart

evalview quickstart

The quickstart spins up a tiny demo agent and runs some tests against it. It takes about fifteen seconds.

evalview run

Add the command to your CI pipeline to enforce guardrails automatically.

Why It Matters

Before EvalView I averaged two or three angry user reports per deploy. Edge‑case failures would consume my evenings debugging production. EvalView works with LangGraph, CrewAI, OpenAI, Anthropic, and essentially any service reachable via HTTP.

Additional Features

  • LLM as judge – checks output quality beyond exact string matching.
  • Test generation from production logs – turn real failures into regression tests automatically (planned).
  • Comparison mode – test different agent versions or configurations side‑by‑side to see which performs better (planned).

Repository

The source code is available at:

If you’ve ever been embarrassed by an agent in production or felt physical pain opening a cloud bill, give EvalView a try. Even saving one late‑night debugging session is worth a star.

Call for Feedback

I’m curious about what others are doing for agent evaluation. Do you have an elaborate eval setup? Share your thoughts in the comments—I’m still figuring this out as I go.

Back to Blog

Related posts

Read more »