What Breaks After Your AI Demo Works

Published: 2 days ago (March 8, 2026 at 12:32 AM EST)

5 min read

Source: Dev.to

Cover image for What Breaks After Your AI Demo Works

A few weeks ago I built a small AI API. Nothing fancy—just a simple endpoint.

response = llm(prompt)

It worked. Requests came in. The model responded. Everything looked good—until the second week.

The First Question

A teammate asked:

“Which request generated this output?”

I checked the logs. There was nothing useful there.

No request ID
No trace
No connection between the prompt and the output

The system worked — but it wasn’t traceable.

The Second Question

Very quickly another question appeared.

“Why did our AI bill jump yesterday?”

We were calling models through an API wrapper, but we weren’t recording:

Token usage
Model pricing
Request‑level cost

We had built an AI system that spent money invisibly.

The Third Question

Then something more subtle happened. A user reported that an output looked wrong. The model had responded successfully, but the answer was clearly not useful. Which raised another question:

“How do we know if a model response is acceptable?”

We didn’t. The API only knew whether the model responded, not whether the result made sense.

The Realization

The model wasn’t the problem. The system around the model was. AI APIs are fundamentally different from traditional APIs. They introduce three operational challenges:

Challenge	What it means
Observability	Can we trace what happened?
Economics	How much did this request cost?
Output reliability	Was the response acceptable?

Without solving these, AI systems quickly become hard to operate. So I built a small reference project to explore this problem. I called it Maester.

The Minimal Reliability Architecture

A reliable AI API request should pass through a few structured steps.

Client Request
      ↓
API Middleware (request_id + trace_id)
      ↓
Route Handler
      ↓
Model Gateway
      ↓
Cost Metering
      ↓
Evaluation
      ↓
Structured Logs
      ↓
Response

Each step adds operational clarity.

1. Observability: Making AI Requests Traceable

The first primitive is observability. Every request should be traceable.

In Maester, middleware attaches request_id and trace_id to the request context:

request_id = new_id()
trace_id   = start_trace()

These identifiers propagate through the entire request lifecycle. Operations are wrapped in spans:

with span("model_generate", model=model_name) as sp:
    response = gateway.generate(prompt)

The span records:

Operation name
Duration
Attributes

Example log output:

{
  "event": "span_end",
  "span": "model_generate",
  "duration_ms": 412,
  "model": "gpt-4o-mini"
}

This gives immediate insight into where time is spent.

2. Cost Metering: AI Systems Spend Money Per Request

Unlike traditional APIs, AI requests have direct monetary cost. Token usage translates into real spend, so every request should produce a cost record.

Example:

cost_record = meter.record(
    model=model_name,
    input_tokens=usage.input_tokens,
    output_tokens=usage.output_tokens,
)

The meter uses a pricing catalog:

MODEL_PRICING = {
    "gpt-4o-mini": {
        "input_per_1k": 0.00015,
        "output_per_1k": 0.00060,
    }
}

The request returns:

{
  "input_tokens": 1200,
  "output_tokens": 350,
  "total_cost_usd": 0.00042
}

Now the API can answer the critical question: “What did this request cost?”

3. Evaluation: Successful Calls Aren’t Always Correct

Even if a model responds successfully, the output may still be unusable. That is where evaluation comes in.

In Maester, responses pass through a simple evaluator:

result = evaluator.evaluate(prompt, response)

Current checks include:

Non‑empty response
Required term presence
Maximum length

Example evaluation result:

{
  "passed": true,
  "checks": {
    "non_empty": true,
    "required_terms": true,
    "max_length": true
  }
}

This pattern becomes more important as systems grow. Evaluation can evolve into:

Structured output validation
Hallucination detection
Policy enforcement
Safety filters

Why Not Just Use OpenTelemetry?

I thought about adopting OpenTelemetry at the very beginning of this project, but decided to roll a home‑made solution instead because … (the original content cuts off here).

OpenTelemetry vs. Maester

OpenTelemetry solves a different problem. It provides:

Distributed tracing
Metrics exporters
Telemetry pipelines

Maester (GitHub) focuses on application‑level reliability primitives. Think of it as the layer that answers:

Question	Answered by Maester
What happened in this AI request?	✅
What model was called?	✅
What did it cost?	✅
Did the result pass validation?	✅

These signals can later be exported to full observability stacks.

The Worker Path

AI systems rarely run only inside HTTP requests. Background jobs often run:

Batch inference
Evaluation pipelines
Data‑enrichment tasks

Maester includes a worker example to demonstrate that the same reliability primitives apply there. Worker execution uses the same tools:

Tracing
Cost metering
Evaluation
Structured logs

Reliability should not depend on the entrypoint.

What This Architecture Achieves

With only a few modules, the system can answer the following questions:

Question	Component
What request generated this output?	`tracing`
How long did the model call take?	`spans`
How many tokens were used?	`cost meter`
What did it cost?	`pricing model`
Was the output valid?	`evaluator`

These signals turn a black‑box AI API into a traceable system.

Final Thought

Most reliability discussions around AI focus on models, but reliability often comes from system design, not model quality.

A simple architecture that records:

What happened
What it costs
Whether the result was acceptable

can dramatically improve how AI systems are operated.

The earlier these ideas are introduced into a system, the easier that system will be to maintain.

Note: This article was originally published on my engineering blog where I document the design of Maester, an AI SaaS infrastructure system built in public.