You're Flying Blind With Your AI Agents. Here's How to Fix It.

Published: (February 28, 2026 at 11:58 AM EST)
5 min read
Source: Dev.to

Source: Dev.to

The Story

Last Tuesday at 2 AM I woke up to a $340 bill from OpenAI.
My coding agent had been running all evening. I thought it was just refactoring some tests, but it had hit an infinite‑retry loop on a malformed API response and burned through 8 million tokens.

I had no idea until the bill arrived.

If you’re building with AI agents (coding assistants, autonomous task runners, chatbots), you’re probably flying blind too. Below is the problem and how to fix it.

What You Don’t See

When you spin up a coding agent like Aider, Cursor, or a custom LangChain workflow, you see the final output:

  • The code it wrote
  • The answer it gave
  • The task it completed

What’s hidden:

  1. How many LLM calls it made to get there
  2. Which models it used (did it really need GPT‑4, or would 3.5 have worked?)
  3. The actual prompts and responses
  4. How long each call took
  5. Which calls failed and were retried
  6. What you’re paying per task

You get a monthly bill from OpenAI or Anthropic, but you can’t trace it back to specific tasks or prompts. It’s like running a web service with no logs and no monitoring—you wouldn’t do that. So why do it with AI?

The Consequences of No Visibility

  1. Surprise bills – Your agent may use far more tokens than expected (e.g., rereading the same file 15 times or sending the entire codebase as context on every call). You won’t know until the bill arrives.
  2. Silent performance issues – Is the slowdown due to LLM latency, network issues, or a bad prompt? Without traces you’re guessing.
  3. No way to optimize – You can’t improve what you can’t measure. Could you use a cheaper model for some calls? Are you over‑prompting? Is caching working? No clue.

Why Existing Observability Platforms Fall Short

Typical advice: use an observability platform (LangSmith, Weights & Biases, Arize, Langfuse, etc.). They’re great, but they have two problems:

  1. Instrumentation overhead – You must instrument every agent, framework, and custom script. If you mix LangChain with raw OpenAI calls and Anthropic SDK calls, getting consistent traces is a nightmare.
  2. Partial coverage – They only see what you send them. Forget to wrap a call and it’s invisible. If a library makes a direct API call, you miss it.

The Real Solution: A Central Proxy Router

What you actually want is a single chokepoint that sees every LLM call automatically, without having to remember to instrument anything.

Architecture

Your Agent → Router → OpenAI / Anthropic / Local Model

Instead of:

Your Agent → OpenAI API
Your Agent → Anthropic API
Your Agent → Local Model

The router sees every request and response, logs them, tracks timing, calculates costs, and shows you what’s really happening.

Introducing NadirClaw

This is how we built NadirClaw (full disclosure: I maintain it; it’s open source at https://github.com/doramirdor/NadirClaw). It started as a cost‑saving tool (routing expensive calls to cheaper models when possible). The observability piece turned out to be far more valuable.

When every LLM call flows through a central point, you automatically get:

  • Full request/response logs – See the exact prompts and raw responses. Debug weird behavior by reading the actual conversation.
  • Cost tracking per task – Tag requests by agent, task, or user. Identify expensive outliers.
  • Latency metrics – p50, p95, p99 latency for each model/provider. Spot slow calls and timeouts early.
  • Error rates & retries – How often do calls fail? Which models have the highest error rates? Are retries intelligent or just burning money?
  • Provider comparison – Compare OpenAI, Anthropic, and local models head‑to‑head on cost, speed, and reliability.
  • Zero instrumentation required – Point your app at the router instead of the API. Everything is logged automatically.

Real‑World Example

Last week a coding agent was supposed to write unit tests. It worked, but felt slow.

Dashboard insights:

  • Average task: 12 LLM calls (far more than expected)
  • 8 of those calls hit GPT‑4
  • 6 GPT‑4 calls had identical prompts

Root cause: a caching bug caused the agent to re‑analyze the same file on every iteration.

  • Before fix: ~90 s per task, $0.40 API cost
  • After fix: ~25 s per task, $0.08 API cost

We fixed it in 10 minutes once we could see the actual call pattern.

Security & Integration

  • The router runs locally (or in your VPC). Your prompts and responses never leave your infrastructure.
  • If you already have an observability stack (Datadog, New Relic, etc.), you can export traces via OpenTelemetry.
  • The built‑in dashboard is sufficient for most teams.

Getting Started

  1. Spin up NadirClaw

    • Docker:

      docker run -p 3000:3000 doramirdor/nadirclaw
    • Or install via npm:

      npm i -g nadirclaw
  2. Point your agent to the router instead of the provider:

    export OPENAI_API_BASE=http://localhost:3000
    # or configure your SDK/client accordingly
  3. Add your API keys to the router config (config.yaml or environment variables).

  4. Open the dashboard at http://localhost:3000/dashboard.

You’ll immediately see every call, every response, costs, and timing—no instrumentation, no SDK changes, nothing.

Final Thought

You wouldn’t run a production service without logs and metrics. Don’t run AI agents without them either.
A central router gives you observability, cost control, and confidence—all with zero code changes.

Give it a try and start optimizing based on real data instead of guesswork. 🚀

Every call gets logged. You can trace problems back to specific prompts and optimize based on real usage patterns.

And when you get a surprise bill at 2 AM, you’ll know exactly what caused it.

Maintainer

Amir Dor maintains NadirClaw, an open‑source LLM router focused on observability and cost optimization. Find it on GitHub:

github.com/doramirdor/NadirClaw

0 views
Back to Blog

Related posts

Read more »

Google Gemini Writing Challenge

What I Built - Where Gemini fit in - Used Gemini’s multimodal capabilities to let users upload screenshots of notes, diagrams, or code snippets. - Gemini gener...