How I cut my multi-turn LLM API costs by 90% (O(N ) O(N))
Source: Dev.to
How I cut my multi-turn LLM API costs by 90% (O(N²) → O(N))
If you build multi‑turn AI agents, you know the pain: API costs don’t grow linearly, they grow quadratically.
Every turn in a standard agent loop replays the full conversation history. Token cost on turn N is proportional to N, so total cost across N turns is Θ(N²). I hit a wall where a single heavy day of coding consumed 97 % of my weekly Anthropic quota.
Burnless – an open protocol and orchestration layer
Burnless flips the cost curve from O(N²) to O(N). The result: a real‑world API consumption reduction of ~16×, and benchmarks showing a 90.3 % cost saving against a naïve Claude Opus call.
Burnless – Intent‑compressed intelligence orchestration.
A maestro that orchestrates any LLM from any vendor. Multi‑turn agent loops normally cost O(N²); Burnless makes them O(N).
- Vendor‑agnostic orchestration for multi‑agent workflows.
- Choose the model that conducts the orchestra (the Maestro/Brain) – Claude, GPT, Gemini, Mistral, a local Llama, etc. – and the models that execute each task (the Workers).
- Tiers are quality/cost bands (gold/silver/bronze), not tied to a specific provider.
- Run encoder and decoder on a local Ollama model for zero marginal cost on cheap stages.
How the cost curve is flattened
The math relies on two mechanisms working together:
-
Shared Prefix Cache – The persistent system prompt (often 20 k+ tokens) is cached using Anthropic’s prompt‑caching feature (TTL = 1 h). Switching models from the same provider mid‑session does not invalidate this cache if the prefix is byte‑identical.
-
Capsule History – Instead of storing raw transcripts, the Maestro model keeps ~80‑character compressed “capsules” of prior turns.
Together, the quadratic history term collapses into a tiny linear one, while the massive system prompt is billed at cache‑read prices (≈10× cheaper than fresh input on Anthropic).
If you want the formal derivation, I published a reproducible benchmark that uses the Anthropic SDK directly and reads response.usage.
Benchmark (Claude 3 Opus, 10‑turn session)
| Configuration | Cost |
|---|---|
| Standalone (no cache) | $4.66 |
| Standalone (+ cache) | $0.65 |
| Burnless Maestro | $0.45 (‑90.3 %) |
The same math applies to any provider that exposes prompt caching and charges per input token.
Configuration example
agents:
gold: { command: "claude --model claude-sonnet-4-6 -p" } # The Brain
silver: { command: "codex exec --sandbox workspace-write" } # Execution
bronze: { command: "ollama run qwen2.5-coder" } # Local, zero marginal cost
Installation & quick start
pip install burnless
burnless setup
If you’re building agents and burning through tokens, give Burnless a try. I’d love to hear your thoughts on the architecture, especially if you’re working on local encoding/decoding for privacy!