The Price Per Million Tokens Is Lying to You

Published: (March 4, 2026 at 08:57 PM EST)
5 min read
Source: Dev.to

Source: Dev.to

Cover image for The Price Per Million Tokens Is Lying to You

OpenMark

Introduction

About 9 months ago, I was building a RAG system (for those who don’t know, it’s a kind of enhanced memory system for AI agents). One of the agentic flows needed semantic similarity, and I had GPT‑4o running it because, well, it was OpenAI’s flagship model. Best model, best results, right?

I decided to actually test that assumption. After a few days of systematic testing, I found that a model costing roughly 10× less (GPT‑4.1‑mini at the time) was giving me equal or better results on that specific task. Not marginally—noticeably better—on a task I assumed required the most recent, most expensive option.

That experience broke something in how I thought about AI model selection, and I’ve spent the months since digging into why this happens and how widespread it is.

The pricing page tells you almost nothing

Every AI provider publishes a price per million tokens (input tokens, output tokens, maybe a cached rate). Simple enough, but this number is close to meaningless in production because it ignores two things that completely change the math.

  1. Tokenization – Different models tokenize the same input differently. GPT‑5, Claude Sonnet 4.5, Gemini 3.0 Flash, etc., will produce different token counts for the exact same prompt. Sometimes the difference is 10‑15 %; sometimes it’s more. So “price per million tokens” is comparing apples to oranges from the start, because a million tokens from one model does not represent the same amount of work as a million tokens from another.

  2. Output volume – This is the bigger factor. Reasoning‑heavy, chain‑of‑thought models generate a lot of tokens. A model like DeepSeek Reasoner, gpt‑5.2‑pro, or Claude Opus 4.6 will think through a problem step‑by‑step, producing many tokens. You ask two models the same question: one gives a 200‑token answer, the other gives 3 000 tokens of reasoning plus a 200‑token answer. The second model might be cheaper per million tokens and still cost you 5× more on the actual task.

I’ve seen this repeatedly: a model that is “10× cheaper” on the pricing page ends up being more expensive in practice because of how it handles the workload. Conversely, a model that looks expensive on paper can be cheaper per task because it’s efficient with its tokens.

Why generic benchmarks don’t help here

The instinct when choosing a model is to check the leaderboards: MMLU, HumanEval, LMArena, LiveBench. These are useful for understanding general capability, but they tell you nothing about your specific use case.

  • I’m not being contrarian; this is the reality of how these models work.
  • Variables are incredibly subtle: the way you phrase a prompt, the structure of your input, even the position of a comma can change which model performs best.
  • A model that scores 92 % on MMLU might score 60 % on your classification task, while a model that scores 85 % on MMLU nails it at 95 %.

And none of these benchmarks account for cost. You could be using the “best” model on the leaderboard and spending 10× what you need to, because a model three tiers below it handles your specific workload just as well—if not better.

What actually matters in production

If you’re running AI in production, or even just evaluating which model to use for a project, the metrics that matter are:

  • Accuracy on your task – Not a generic benchmark. Use your actual prompts, data, and expected outputs.
  • Real token cost – Not “price per million,” but what the model actually costs per task, per call, per pipeline run. This includes input tokens (which vary by tokenizer), output tokens (which vary wildly by model behavior), and any reasoning tokens that get billed.
  • Latency – Time to first token and total completion time. For agentic workflows or user‑facing features, this matters as much as cost.
  • Consistency – Some models give brilliant output 70 % of the time and garbage the other 30 %; others are boringly reliable. For production, boring and reliable wins every time.

Getting these numbers requires actually running your workload across multiple models—not once, not with a single prompt, but systematically, on a schedule, with enough variation to get statistically meaningful results. Most teams don’t do this because it’s tedious and time‑consuming. They pick the model that “feels right” based on what seems to work and leaderboard rankings, ship it, and never look back.

That’s how you end up spending $10 k/month on API calls when $2 k would give you the same output quality.

The real lesson

The AI model market is moving fast: new models every few weeks, price cuts, capability jumps, new providers entering. The model that was optimal for your use case three months ago might not be optimal today.

The only way to actually know what works best for you is to test it—on your data, with your prompts, measuring the things that matter for your specific situation. Everything else is guessing.

I learned this the hard way when I discovered I was overpaying by 10× on a pipeline I assumed needed a flagship model. Since then, I’ve made it a practice to re‑evaluate model selection whenever a significant new release drops. The cost savings and performance improvements make it worth it every single time.

B

Marc Kean Paker is the founder of OpenMark, an AI model benchmarking platform designed to move teams away from leaderboard guessing and toward deterministic, cost‑aware model selection.

0 views
Back to Blog

Related posts

Read more »