I Spent $12 on 4 AI Questions. Then Linux Foundation Made MCP Official.
Source: Dev.to
Why I Chose Assistants API (And Why You Probably Did Too)
Let me be honest: Assistants API is genuinely impressive. The developer experience is incredible. Here’s what pulled me in:
The Promise
- Built‑in RAG out of the box
- Persistent conversation threads
- Automatic tool calling
- File upload and instant querying
- “Just works” in 2 hours
The Appeal
As someone running FPL Hub (2,000+ users, 500 K+ daily API calls), I know the value of managed infrastructure. Assistants API felt like the right abstraction layer. Why manage chunking strategies, vector stores, and context windows when OpenAI handles it all?
I uploaded a PDF, asked my questions, and got accurate responses. The prototype worked beautifully—until I checked my bill.
The Hidden Cost Structure Nobody Warns You About
OpenAI’s pricing page lists:
- GPT‑4o: $5 input / $15 output per 1 M tokens
- Code Interpreter: $0.03 per session
- File Search: $0.10 / GB / day
That looks reasonable, but the actual charges can be surprising.
The Real Math for My “Simple” Query
PDF (10 pages, ~5K tokens)
↓
Vector Store automatic chunking → 50,000 tokens
↓
Retrieval augmentation per query → 20,000 tokens
↓
Context window (conversation history) → 8,000 tokens
↓
Tool call overhead → 3,000 tokens
↓
Your actual query + response → 250 tokens
────────────────────────────────────
Total per question: ~81,000 tokens = $0.81
Four questions broke down like this
- Model costs: $3.24 (324 K tokens)
- Code Interpreter sessions: $0.06
- File Search storage (3 days): $0.30
- Hidden retrieval costs: $8.87
Total: $12.47
Why Costs Spiral
- Token multiplication you can’t control – Assistants API automatically chunks documents for vector search. A 5 K‑token PDF becomes ~50 K tokens in storage, and each retrieval multiplies that further.
- Context window bloat – Every follow‑up question reloads the entire conversation history. Question 1 costs $0.81; by question 4 the cost rises to $3.50 because of accumulated context.
- Storage fees compound daily – $0.10 / GB / day adds up quickly:
- 1 GB document ≈ $3 /month
- 10 GB knowledge base ≈ $30 /month
- Hidden retrieval costs – The File Search tool not only retrieves chunks; it also augments each query with those chunks, incurring embedding, similarity search, and prompt token costs multiplied by conversation history.
Real‑World Cost Projections
Customer support bot (1 K conversations/day)
- 5 messages per conversation
- 2 knowledge‑base documents (≈500 pages)
- Storage: $6 /day → $180 /month
- Queries: ~300 K tokens/day → $300 /day
Total: ≈ $9 180 /month
Document analysis app
- User uploads 5 PDFs (≈250 pages)
- 10 questions per document, 3 follow‑ups each
Cost per user session: $45
100 users: $4 500 /month
My actual use case
- 4 test questions, 1 small PDF (10 pages), 2 conversation threads
Cost: $12.47 → projected $3 100 /month at 1 K users.
The MCP Alternative: Same Features, 99 % Cost Reduction
What is MCP?
Model Context Protocol (MCP) is an open standard for connecting AI models to data sources and tools—think USB‑C for AI. As of December 9 2025, it’s an official Linux Foundation project.
Founding members include Anthropic, OpenAI, Google, Microsoft, AWS, Cloudflare, Bloomberg, and Block.
Architecture Comparison
Traditional Assistants API flow
flowchart LR
A[User] --> B[OpenAI API]
B --> C[Thread Storage]
B --> D[Vector Store]
B --> E[GPT‑4]
E --> F[Response]
style C fill:#f9f,stroke:#333,stroke-width:2px
style D fill:#f9f,stroke:#333,stroke-width:2px
Metered components: thread storage ($0.10 / GB / day), vector store retrieval, token usage.
MCP flow
flowchart LR
A[User] --> B[MCP Client]
B --> C[Your MCP Server]
C --> D[Cloudflare Workers]
D --> E[Any Model]
E --> F[Response]
You control storage and retrieval; Cloudflare Workers provide 10 M free requests/month.
Key Architectural Differences
-
Client‑side memory – Conversation state is stored on the client, eliminating daily storage fees.
-
Multi‑model support – One MCP server can route requests to any model:
// Switch models per request const response = await mcp.callTool("search_documents", { query: userQuery, model: "groq/llama-3.3-70b-versatile" // Free tier }); -
Edge deployment on Cloudflare Workers – Deploy globally in minutes with no cold starts:
export default { async fetch(request, env) { const mcp = new MCPServer(env); return mcp.handle(request); } }; -
Complete cost control – You decide chunk limits, caching, and model pricing before sending a request:
const searchConfig = { maxChunks: 3, chunkSize: 500, cacheStrategy: "lru", model: "groq-free" }; const estimatedCost = calculateTokens(chunks) * modelPrice; if (estimatedCost > threshold) { // fallback to cheaper model or reduce chunks }
My MCP Implementation
// MCP Server on Cloudflare Workers
import { MCPServer } from "@modelcontextprotocol/sdk";
interface MCPTools {
search_documents: (query: string, maxChunks?: number) => Promise;
analyze_pdf: (fileId: string) => Promise;
summarize_conversation: () => Promise;
}
// Cost breakdown for the same 4 questions:
const costs = {
workersAI_embeddings: 0.011 / 1000, // $0.001 per 1 K tokens (example)
vectorize_storage: 0, // Included in free tier
// ...additional cost items as needed
};
Using MCP, the same four‑question workflow costs a fraction of the $12.47 spent with the Assistants API, demonstrating how an open protocol can dramatically reduce AI‑driven application expenses.