My AI agent cost me $400 overnight, so I built pytest for agents and open-sourced it

Published: 1 week ago (December 8, 2025 at 04:52 AM EST)

2 min read

Source: Dev.to

Background

At 2 am I was staring at my OpenAI dashboard, wondering how my bill jumped from $80 to $400 in a single day. After six months of running custom AI agents in production, I learned the hard way that agents that work perfectly on a local machine can betray you in production.

Introducing EvalView

I needed something dead simple: write down what the agent is supposed to do, run it, and fail the build if it does something stupid. The idea is embarrassingly simple—describe the expected behavior in a YAML file and let the test framework enforce it.

Example Test

name: order lookup
input:
  query: "What's the status of order 12345?"
expected:
  tools:
    - get_order_status
thresholds:
  max_cost: 0.10

If the agent answers without calling get_order_status, the test fails.
If it costs more than 10 cents, the test fails.

A red error in CI breaks the build and blocks deployment.

Getting Started

pip install evalview

Quickstart

evalview quickstart

The quickstart spins up a tiny demo agent and runs some tests against it. It takes about fifteen seconds.

evalview run

Add the command to your CI pipeline to enforce guardrails automatically.

Why It Matters

Before EvalView I averaged two or three angry user reports per deploy. Edge‑case failures would consume my evenings debugging production. EvalView works with LangGraph, CrewAI, OpenAI, Anthropic, and essentially any service reachable via HTTP.

Additional Features

LLM as judge – checks output quality beyond exact string matching.
Test generation from production logs – turn real failures into regression tests automatically (planned).
Comparison mode – test different agent versions or configurations side‑by‑side to see which performs better (planned).

Repository

The source code is available at:

If you’ve ever been embarrassed by an agent in production or felt physical pain opening a cloud bill, give EvalView a try. Even saving one late‑night debugging session is worth a star.

Call for Feedback

I’m curious about what others are doing for agent evaluation. Do you have an elaborate eval setup? Share your thoughts in the comments—I’m still figuring this out as I go.

My AI agent cost me $400 overnight, so I built pytest for agents and open-sourced it

Background

Introducing EvalView

Example Test

Getting Started

Quickstart

Why It Matters

Additional Features

Repository

Call for Feedback

Related posts

We found our site was slow in Singapore but perfect in Europe — here's why

I put a Game Boy inside ChatGPT (ChatGPT Apps)

Advent of AI - Day 13: Goose Terminal Integration

A Day in the Life of a Marketing Manager Using Microsoft Planner