My AI bot burned through my API budget overnight. So I built an open-source tool to make sure it never happens again.
Source: Dev.to
I run an autonomous AI news engine called El Sapo Cripto. It monitors 25+ RSS feeds, scores articles, generates Spanish-language summaries with Gemini, creates images, and publishes to Telegram and X. All day, every day, zero human intervention. One morning I woke up to a ~$4 bill from Google. Not a lot, right? But my usual daily spend was under $0.25. Something was very wrong.
My app runs on Railway. Railway occasionally restarts containers. My budget tracker lived in memory. Restart = budget reset = the bot thought it had a fresh $0 balance and went wild. Gemini Flash calls piled up - summarizing, re-summarizing, processing articles it had already processed. I caught it by accident, scrolling through the billing page. There was no alert. No dashboard. No way to see what happened without manually reading through logs. And here’s the thing that really bothered me: the app was returning 200 OK on every request. Prometheus would’ve shown zero errors. Traditional monitoring would’ve said “everything’s fine” while the bot was eating money. I started looking around for tools to monitor LLM calls. Found a few options: Langfuse, Helicone, OpenLLMetry. All solid projects. But they all shared the same limitation — they show you what your LLM is doing, but they don’t tell you if it’s doing it well. I don’t just need to see that my bot made 200 Gemini calls. I need to know: did the summaries get worse after I changed the prompt last Tuesday? Are error rates creeping up because the provider is having issues? Is the cost per article going up because responses are getting longer? Is the model quietly refusing to summarize certain topics? I’m a Senior SDET by background. 8+ years building test frameworks and quality infrastructure. In the testing world, we don’t just log requests — we assert on behavior, detect regressions, set quality gates. None of the existing LLM tools did that. So I built one.
toad-eye is an open-source observability toolkit for LLM systems, built on OpenTelemetry. You install it, run three commands, and get full visibility into every LLM call your app makes.
That gives you Grafana with 6 pre-built dashboards, Jaeger for trace inspection, and Prometheus for metrics. All pre-configured, all running locally. The SDK auto-instruments your LLM calls. No wrappers, no code changes: import { initObservability } from ‘toad-eye’;
initObservability({ serviceName: ‘my-app’, instrument: [‘openai’, ‘anthropic’] });
// every OpenAI and Anthropic call is now traced automatically
It tracks latency, token usage, cost, error rates — broken down by provider and model. If you’re using GPT-4o for some things and Claude for others, you see them side by side. The features I built come directly from problems I hit in production. Budget guards. The thing that would’ve saved me from the El Sapo incident. Set a daily budget, per-user budget, or per-model budget. toad-eye checks before every LLM call. If you’re over budget, it can warn, block, or automatically downgrade to a cheaper model. initObservability({ serviceName: ‘my-app’, budgets: { daily: 50, perModel: { ‘gpt-4o’: 30 } }, onBudgetExceeded: ‘block’ });
Semantic drift monitoring. This is the one I’m most proud of. LLMs can silently degrade — the model returns 200 OK, but the answers are getting worse. Maybe the provider updated the model weights, maybe your prompt doesn’t work well with the new version. Traditional monitoring can’t catch this. toad-eye saves embeddings of your “good” responses as a baseline. Then it periodically compares new responses against that baseline. If the average distance grows beyond a threshold, you get an alert: “semantic drift detected.” Your model is still responding, but it’s not responding the same way. Shadow guardrails. You want to add validation rules (no PII in responses, must be valid JSON, etc.) but you’re scared they’ll block legitimate traffic. Shadow mode runs the validation on every response but doesn’t block anything. It just records what would have been blocked. You see a “potential block rate” in Grafana and can tune your thresholds on real production data before flipping the switch. Agent tracing. AI agents (the think-act-observe-repeat kind) are notoriously hard to debug. toad-eye records each step as a nested OpenTelemetry span. You can open Jaeger and see exactly how your agent decided to call which tools and why.
Trace-to-test export. Found a bad trace in production? One CLI command exports it as a test case for your eval suite. Production failure becomes a regression test. npx toad-eye export-trace —output ./evals/
FinOps attribution. Break down costs by team, user, feature — not just by model. “The checkout team spent $28 yesterday, mostly on GPT-4o for classification. Switching to Flash would save 60%.” That’s the kind of insight that makes engineering managers pay attention.
Current state of the project: 154 tests passing 6 Grafana dashboards 13 tracked metrics 3 auto-instrumented SDKs (OpenAI, Anthropic, Gemini) with full streaming support Published on npm Self-hosted or cloud mode OTel GenAI semantic conventions compliant toad-eye is the observability module of TOAD (Testing & Observability for AI Development) — an ecosystem of tools for AI quality: toad-eye — observability (this article) toad-guard — LLM output validation with Zod toad-eval — test suites for prompts toad-ci — CI/CD quality gates for prompt changes toad-mcp — Claude Desktop integration via MCP The idea is simple: AI systems deserve the same quality engineering rigor we apply to regular software. Observability is where it starts, but testing, validation, and CI gates are where quality actually happens. npm install toad-eye npx toad-eye init npx toad-eye up npx toad-eye demo
Open localhost:3100. You’ll see your dashboards with data in under 2 minutes. GitHub: https://github.com/vola-trebla/toad-eye https://www.npmjs.com/package/toad-eye If you’re running LLMs in production without observabilit
y - you’re flying blind. And trust me, you don’t want to find out about your budget problem from a billing email. This is an early-stage project and I’m actively developing it. If you try it out, I’d love to hear what works, what doesn’t, and what features you’d want next. Open an issue on GitHub or just drop a comment here. The toad is watching. 🐸👁️