How I Cut My AI API Bill by 90% With a Multi-Model Routing System

Published: 1 day ago (May 10, 2026 at 02:04 PM EDT)

3 min read

Source: Dev.to

⚠️ Collection Error: Content refinement error: Error: 429 “you (bkperio) have reached your weekly usage limit, upgrade for higher limits: https://ollama.com/upgrade (ref: f827aa9f-0aae-47bb-9b63-036ef6341b1e)”

Last month my Claude API bill was $847. This month it’s $73. Same output quality. Here’s the system I built. I run multiple AI-powered services — content generation, email classification, SEO optimization, data extraction. Every call was going to Claude Sonnet because “it works.” But most of those calls didn’t need Sonnet-level intelligence. Classifying an email as spam? That’s a Haiku job. Generating embeddings? Ollama handles that for free. Writing a full article? OK, that’s Sonnet. But only 15% of my calls actually needed the expensive model. I built a routing layer that sits between my application code and the LLM providers. Every request gets classified by complexity, then routed to the cheapest model that can handle it. from empire_router import router

Auto-routed based on task complexity

response = router.complete( prompt=“Classify this email: …”, task=“classify” # Routes to Haiku ($0.80/M tokens) )

response = router.complete( prompt=“Write a 1500-word article about…”, task=“generate” # Routes to Sonnet ($3/M tokens) )

embedding = router.embed(“text to embed”) # Routes to Ollama (FREE)

Task classification: ├── Binary/classification → Haiku ($0.80/$4 per M tokens) ├── Embeddings → Ollama on VPS (FREE) ├── Simple extraction → DeepSeek ($0.27/M) or Groq (FREE) ├── Content generation → Sonnet ($3/$15 per M tokens) └── Complex reasoning → Opus ($15/$75) — used <2% of calls

Task-type routing, not content-length routing My first attempt routed by prompt length. Terrible idea. A short prompt like “Is this email spam?” needs a cheap model regardless of length. A short prompt like “Design the architecture for a distributed cache” needs an expensive one. The task type is what determines model selection, not the token count.
Fallback chains, not single-model assignments ROUTING_CHAINS = { “classify”: [“ollama/llama3.1:8b”, “groq/llama3”, “haiku”, “sonnet”], “generate”: [“sonnet”, “opus”], “embed”: [“ollama/nomic-embed-text”, “voyage-3”], }

If the primary model is down or rate-limited, it cascades to the next option. No failed requests, just slightly higher cost on fallback. 3. Quality gates on cheap models The router doesn’t blindly trust cheap model output. For tasks where accuracy matters, it runs a quality check: Send to cheap model first Score the response (confidence, format validity, coherence) If score < threshold → retry on next model in chain Log the escalation for future routing optimization In practice, Haiku handles 94% of classification tasks without escalation. 4. Prompt caching for repeated patterns System prompts that exceed 500 characters get cached. For classification tasks that run the same system prompt thousands of times, this cuts input costs by 90% after the first call.

Metric Before After Change

Monthly cost $847 $73 -91%

Avg latency 2.1s 0.8s -62%

Failed requests 12/day 0.3/day -97%

Quality (human eval) 4.2/5 4.1/5 -2%

The quality dip is within noise. The latency improvement comes from Haiku being faster than Sonnet, plus Ollama embeddings having no network round-trip. I run Ollama on a Contabo VPS (CPU-only, $15/mo). It handles: All embeddings (nomic-embed-text) Simple classification fallback (llama3.1:8b) Data extraction on non-sensitive content Everything that needs quality or handles sensitive data goes to API providers. The VPS pays for itself in 2 days of avoided API calls. The routing pattern works with any LLM provider combination. The key insight: treat model selection as a runtime decision, not a deployment decision. I write about practical AI cost optimization and infrastructure at wealthfromai.com. The full router is open for anyone building similar multi-model systems — DM me if you want the architecture details.

How I Cut My AI API Bill by 90% With a Multi-Model Routing System

Auto-routed based on task complexity

Related posts

How to Test MCP Servers Before They Break Your CI

ForgeOS Dojo - learn AI-assisted development, build something that matters

让 AI Agent 学会共享经验——我做了个'蚁群信息素'实验

The Gap Nobody Talks About :Students, Companies & The Technology Pressure