Your AI Agent Works. That's Why Finance Is About to Kill It.

Published: (May 10, 2026 at 03:51 PM EDT)
8 min read
Source: Dev.to

Source: Dev.to

Overview

Two teams deployed the same multi‑agent workflow last quarter.

  • Team A: $0.12 per run.
  • Team B: $1.40 per run.

Same model, same task, same outcome quality. The $1.40 team delivered a polished POC, a demo that crushed, and a board deck full of green checkmarks. Six weeks into production, finance pulled the plug. The $0.12 team is now serving ten times the volume on a smaller infrastructure budget than the original pilot.

The gap does not come from model choice, prompt quality, or engineering talent. It comes from a single discipline that almost nobody in the agentic‑AI conversation is talking about out loud: tokenomics.

We talk endlessly about evals, context engineering, orchestration patterns, RAG pipelines. We do not talk about the unit economics of a single agent run — even though that number is the only thing that decides whether a system gets to live past the pilot phase.

This post explains why and outlines the four token cost surfaces and three architecture decisions that separate the $0.12 systems from the $1.40 ones.

Token Cost Surfaces

Every token a model processes falls into one of four buckets. Most teams only consciously think about one or two.

┌─────────────────────────────────────────────────────┐
│              TOKEN COST SURFACES                     │
│                                                     │
│  1. PROMPT TOKENS                                   │
│     System prompts, instructions, user input,       │
│     retrieved docs, tool schemas                    │
│     → Tax paid on every single call, forever        │
│                                                     │
│  2. CONTEXT TOKENS                                  │
│     Conversation history, agent scratchpad,         │
│     accumulated inter‑agent state                   │
│     → Grows fast in agent loops                     │
│                                                     │
│  3. REASONING TOKENS   ← most engineers miss this   │
│     Chain‑of‑thought thinking, internal planning    │
│     Invisible to the user, very visible on invoice  │
│     → Extended thinking models (o3, Claude 3.7)       │
│                                                     │
│  4. OUTPUT TOKENS                                   │
│     What the model writes back                      │
│     → Usually smallest bucket, easiest to control  │
└─────────────────────────────────────────────────────┘
  • Prompt tokens are the most underestimated. A 2,000‑token system prompt prepended to every call is a tax you pay on every interaction for the entire life of the system. At 100,000 calls/day, that’s 200 M tokens of overhead every day before the model does any useful work.
  • Context tokens are the most dangerous in agent systems because agents maintain state across turns, and that state compounds.
  • Reasoning tokens are the newest blind spot. Extended‑thinking models (e.g., o3, Claude 3.7) consume tokens for internal planning that are invisible in logs but very visible on invoices. A complex planning task can generate 10,000+ reasoning tokens before producing a single word of output.
  • Output tokens are the easiest win. They’re usually the smallest bucket and the most controllable—format instructions, response‑length caps, and structured output schemas all help here.

Chatbot vs. Naïve Agent Loop

A chatbot’s token usage is predictable:

CHATBOT (1 call)
  User Input [~200 tokens]

  System Prompt + Context [~1,500 tokens]

  Model Response [~300 tokens]

  Total: ~2,000 tokens per interaction ✓

A naïve 5‑step agent loop quickly balloons:

5‑STEP AGENT LOOP (naïve implementation)

Turn 1: Planner reads full context → decides tool → 3,000 tokens

Tool A executes → returns 800‑token output

Turn 2: Executor reads context + tool output → 4,200 tokens

Turn 3: Sub‑agent reads accumulated history → 5,100 tokens

Turn 4: Verifier reads everything above → 6,800 tokens

Turn 5: Formatter reads accumulated context → 7,400 tokens

Total: ~27,000 tokens per run ← 13.5× the chatbot estimate

Every hop re‑reads the full history, so a five‑step loop can involve eight, twelve, or even twenty model calls, each paying the full context cost.

Tokenomics at Scale

Users/dayTokens/run (naïve)Tokens/run (optimized)Monthly delta
1,00025,0005,000600 M tokens
10,00025,0005,0006 B tokens
100,00025,0005,00060 B tokens

At enterprise volume, the difference between a thoughtful architecture and a naïve one isn’t a percentage—it’s an order of magnitude.

Agent Architecture Map

┌─────────────────────────────────────────────────────────────┐
│                   AGENT ARCHITECTURE MAP                     │
│             [amber = where cost is decided]                  │
└─────────────────────────────────────────────────────────────┘

         USER REQUEST


    ┌─────────────────────┐
    │   ROUTING LAYER  🟡 │  ← Cost decided here: small vs large model
    │  (Intent classifier)│     GPT‑4o Mini vs GPT‑4o: 10‑30× price diff
    └──────────┬──────────┘


    ┌─────────────────────┐
    │  TOKEN BUDGET    🟡 │  ← Hard cap per hop, per run
    │  CONTROLLER         │     Rejects or truncates before it's too late
    └──────────┬──────────┘


    ┌─────────────────────────────────────────────────┐
    │                 AGENT LOOP                       │
    │                                                  │
    │   ┌─────────────┐      ┌─────────────────┐      │
    │   │  CONTEXT  🟡│      │  TOOL OUTPUTS 🟡│      │
    │   │  INPUTS     │      │  (RAG, APIs,    │      │
    │   │  (history,  │      │  sub‑agents)    │      │
    │   │  scratchpad)│      └────────┬────────┘      │
    │   └──────┬──────┘               │               │
    │          └──────────┬───────────┘               │
    │                     ▼                           │
    │           ┌─────────────────┐                   │
    │           │  SUPERVISOR  🟡 │                   │
    │           │  (orchestrator)│                   │
    │           └────────┬────────┘                   │
    │                    │ (handoff carries            │
    │                    │  full context payload)      │
    │                    ▼                             │
    │           ┌─────────────────┐                   │
    │           │  SUB‑AGENTS  🟡 │                   │
    │           └─────────────────┘                   │
    └──────────────────┬──────────────────────────────┘


    ┌─────────────────────┐
    │  CACHING LAYER   🟡 │  ← Prompt cache hits can cut cost 60‑90%
    │  (semantic cache)   │
    └──────────┬──────────┘


    ┌─────────────────────┐
    │  TOKEN TELEMETRY 🟡 │  ← Per‑hop visibility: where is cost going?
    │  + COST METER       │
    └─────────────────────┘

The amber boxes are where token cost is either compounded or controlled.

  • Top (routing + budget controller): cost decided before expensive work starts.
  • Middle (context inputs + agent loop): cost compounded — the primary bleed point.
  • Bottom (caching + telemetry): cost controlled and made visible.

The survival question is simple: how much of your amber is working for you versus against you?

The Three Architecture Decisions That Matter

Decision 1: Route Before You Reason

Not every task needs your most powerful model. This is the single highest‑leverage decision in your cost architecture.

# Naïve: all tasks go to the same model
response = openai.chat.completions.create(
    model="gpt-4o",   # $15/M output tokens
    messages=[{"role": "user", "content": user_input}]
)

# Optimized: route by complexity first
def route_to_model(task: str) -> str:
    """Intent classifier determines which model handles this request."""
    complexity = classify_task_complexity(task)

    if complexity == "simple":    # FAQ, format, classify
        return "gpt-4o-mini"      # $0.60/M output tokens — 25× cheaper
    elif complexity == "medium":  # Summarize, draft, analyze
        return "gpt-4o"           # $15/M output tokens
    else:                         # Multi‑step reasoning, code generation
        return "o3"               # Premium reasoning — use sparingly

model = route_to_model(user_input)
response = openai.chat.completions.create(model=model, messages=[...])

The routing classifier itself is a cheap call—a small model or even a regex‑based heuristic. Routing 70 % of traffic to a lightweight model while reserving the expensive reasoning model for genuinely complex tasks can drop total cost by 60–80 %.

Token Budget Controller (example)

class TokenBudgetController:
    """Hard token caps per agent hop — rejects or truncates before overspend."""
    def __init__(self, per_hop_limit: int = 4000, total_run_limit: int = 20000):
        self.per_hop_limit = per_hop_limit
        self.total_run_limit = total_run_limit
        self.tokens_spent = 0

    def check_and_trim(self, context: str, model: str) -> str:
        """Trim context to stay within budget before it hits the model."""
        token_count = count_tokens(context, model)

        if self.tokens_spent + token_count > self.total_run_limit:
            raise RunBudgetExceeded(
                f"Run budget exhausted: {self.tokens_spent} spent"
            )

        if token_count > self.per_hop_limit:
            # Trim from the middle, preserve system prompt + recent history
            context = trim_to_budget(context, self.per_hop_limit,
                                    strategy="recent_first")

        self.tokens_spent += token_count
        return context

    def record_output(self, output_tokens: int):
        self.tokens_spent += output_tokens

Budget controllers prevent runaway loops and force you to prune context to only what truly matters at each step.

Prompt Caching

Prompt caching is one of the most under‑used optimizations in production AI systems. Anthropic, OpenAI, and Google all support it.

# Without caching: system prompt re‑tokenized on every call
# Cost: 2,000 tokens × N calls

# With caching: system prompt tokenized once, cache hit on subsequent calls
messages = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": SYSTEM_PROMPT,  # 2,000 tokens
                "cache_control": {"type": "ephemeral"}  # ← cache this
            }
        ]
    },
    {"role": "user", "content": user_message}
]

Anthropic’s cache_control API can make the prompt 90 % cheaper. At 10,000 calls/day on a 2,000‑token system prompt:

  • Without cache: 2,000 × 10,000 = 20 M tokens/day
  • With cache: 200 × 10,000 = 2 M tokens/day → 90 % reduction

Beyond prompt caching, semantic caching—reusing responses for similar queries—can eliminate entire classes of redundant agent runs. For workloads with many structurally similar questions, hit rates above 30 % are routinely achievable.

(Decisions 2 & 3)

The article continues to emphasize that tokenomics is an architecture constraint, not an after‑the‑fact optimization. The remaining two high‑impact decisions are:

  1. Enforce hard token budgets per hop (as shown above).
  2. Instrument full‑stack token telemetry so you can see cost per hop, per run, and per user segment. Token telemetry is to AI systems what APM is to distributed services.

Takeaways

  • Four cost surfaces: Prompt, Context, Reasoning, and Output tokens each require distinct control strategies.
  • Route before you reason: A routing layer that sends the majority of traffic to a lightweight model yields the highest ROI.
  • Telemetry is mandatory: Without per‑hop visibility you cannot manage token spend.
  • Cost‑aware architecture wins: Teams that win in production AI are those that embed tokenomics into design from day one, not those that simply pick the biggest model.

If you’ve shipped a production agent system—whether you’ve solved the economics or are still fighting it—I’d genuinely like to know what moved the needle for you. Drop it in the comments.

0 views
Back to Blog

Related posts

Read more »