What Breaks After Your AI Demo Works
Source: Dev.to

A few weeks ago I built a small AI API. Nothing fancy—just a simple endpoint.
response = llm(prompt)
It worked. Requests came in. The model responded. Everything looked good—until the second week.
The First Question
A teammate asked:
“Which request generated this output?”
I checked the logs. There was nothing useful there.
- No request ID
- No trace
- No connection between the prompt and the output
The system worked — but it wasn’t traceable.
The Second Question
Very quickly another question appeared.
“Why did our AI bill jump yesterday?”
We were calling models through an API wrapper, but we weren’t recording:
- Token usage
- Model pricing
- Request‑level cost
We had built an AI system that spent money invisibly.
The Third Question
Then something more subtle happened. A user reported that an output looked wrong. The model had responded successfully, but the answer was clearly not useful. Which raised another question:
“How do we know if a model response is acceptable?”
We didn’t. The API only knew whether the model responded, not whether the result made sense.
The Realization
The model wasn’t the problem. The system around the model was. AI APIs are fundamentally different from traditional APIs. They introduce three operational challenges:
| Challenge | What it means |
|---|---|
| Observability | Can we trace what happened? |
| Economics | How much did this request cost? |
| Output reliability | Was the response acceptable? |
Without solving these, AI systems quickly become hard to operate. So I built a small reference project to explore this problem. I called it Maester.
The Minimal Reliability Architecture
A reliable AI API request should pass through a few structured steps.
Client Request
↓
API Middleware (request_id + trace_id)
↓
Route Handler
↓
Model Gateway
↓
Cost Metering
↓
Evaluation
↓
Structured Logs
↓
Response
Each step adds operational clarity.
1. Observability: Making AI Requests Traceable
The first primitive is observability. Every request should be traceable.
In Maester, middleware attaches request_id and trace_id to the request context:
request_id = new_id()
trace_id = start_trace()
These identifiers propagate through the entire request lifecycle. Operations are wrapped in spans:
with span("model_generate", model=model_name) as sp:
response = gateway.generate(prompt)
The span records:
- Operation name
- Duration
- Attributes
Example log output:
{
"event": "span_end",
"span": "model_generate",
"duration_ms": 412,
"model": "gpt-4o-mini"
}
This gives immediate insight into where time is spent.
2. Cost Metering: AI Systems Spend Money Per Request
Unlike traditional APIs, AI requests have direct monetary cost. Token usage translates into real spend, so every request should produce a cost record.
Example:
cost_record = meter.record(
model=model_name,
input_tokens=usage.input_tokens,
output_tokens=usage.output_tokens,
)
The meter uses a pricing catalog:
MODEL_PRICING = {
"gpt-4o-mini": {
"input_per_1k": 0.00015,
"output_per_1k": 0.00060,
}
}
The request returns:
{
"input_tokens": 1200,
"output_tokens": 350,
"total_cost_usd": 0.00042
}
Now the API can answer the critical question: “What did this request cost?”
3. Evaluation: Successful Calls Aren’t Always Correct
Even if a model responds successfully, the output may still be unusable. That is where evaluation comes in.
In Maester, responses pass through a simple evaluator:
result = evaluator.evaluate(prompt, response)
Current checks include:
- Non‑empty response
- Required term presence
- Maximum length
Example evaluation result:
{
"passed": true,
"checks": {
"non_empty": true,
"required_terms": true,
"max_length": true
}
}
This pattern becomes more important as systems grow. Evaluation can evolve into:
- Structured output validation
- Hallucination detection
- Policy enforcement
- Safety filters
Why Not Just Use OpenTelemetry?
I thought about adopting OpenTelemetry at the very beginning of this project, but decided to roll a home‑made solution instead because … (the original content cuts off here).
OpenTelemetry vs. Maester
OpenTelemetry solves a different problem. It provides:
- Distributed tracing
- Metrics exporters
- Telemetry pipelines
Maester (GitHub) focuses on application‑level reliability primitives. Think of it as the layer that answers:
| Question | Answered by Maester |
|---|---|
| What happened in this AI request? | ✅ |
| What model was called? | ✅ |
| What did it cost? | ✅ |
| Did the result pass validation? | ✅ |
These signals can later be exported to full observability stacks.
The Worker Path
AI systems rarely run only inside HTTP requests. Background jobs often run:
- Batch inference
- Evaluation pipelines
- Data‑enrichment tasks
Maester includes a worker example to demonstrate that the same reliability primitives apply there. Worker execution uses the same tools:
- Tracing
- Cost metering
- Evaluation
- Structured logs
Reliability should not depend on the entrypoint.
What This Architecture Achieves
With only a few modules, the system can answer the following questions:
| Question | Component |
|---|---|
| What request generated this output? | tracing |
| How long did the model call take? | spans |
| How many tokens were used? | cost meter |
| What did it cost? | pricing model |
| Was the output valid? | evaluator |
These signals turn a black‑box AI API into a traceable system.
Final Thought
Most reliability discussions around AI focus on models, but reliability often comes from system design, not model quality.
A simple architecture that records:
- What happened
- What it costs
- Whether the result was acceptable
can dramatically improve how AI systems are operated.
The earlier these ideas are introduced into a system, the easier that system will be to maintain.
Note: This article was originally published on my engineering blog where I document the design of Maester, an AI SaaS infrastructure system built in public.
