I Spent $12 on 4 AI Questions. Then Linux Foundation Made MCP Official.

Published: (December 15, 2025 at 08:10 AM EST)
4 min read
Source: Dev.to

Source: Dev.to

Why I Chose Assistants API (And Why You Probably Did Too)

Let me be honest: Assistants API is genuinely impressive. The developer experience is incredible. Here’s what pulled me in:

The Promise

  • Built‑in RAG out of the box
  • Persistent conversation threads
  • Automatic tool calling
  • File upload and instant querying
  • “Just works” in 2 hours

The Appeal

As someone running FPL Hub (2,000+ users, 500 K+ daily API calls), I know the value of managed infrastructure. Assistants API felt like the right abstraction layer. Why manage chunking strategies, vector stores, and context windows when OpenAI handles it all?

I uploaded a PDF, asked my questions, and got accurate responses. The prototype worked beautifully—until I checked my bill.

The Hidden Cost Structure Nobody Warns You About

OpenAI’s pricing page lists:

  • GPT‑4o: $5 input / $15 output per 1 M tokens
  • Code Interpreter: $0.03 per session
  • File Search: $0.10 / GB / day

That looks reasonable, but the actual charges can be surprising.

The Real Math for My “Simple” Query

PDF (10 pages, ~5K tokens)

Vector Store automatic chunking → 50,000 tokens

Retrieval augmentation per query → 20,000 tokens

Context window (conversation history) → 8,000 tokens

Tool call overhead → 3,000 tokens

Your actual query + response → 250 tokens
────────────────────────────────────
Total per question: ~81,000 tokens = $0.81

Four questions broke down like this

  • Model costs: $3.24 (324 K tokens)
  • Code Interpreter sessions: $0.06
  • File Search storage (3 days): $0.30
  • Hidden retrieval costs: $8.87

Total: $12.47

Why Costs Spiral

  1. Token multiplication you can’t control – Assistants API automatically chunks documents for vector search. A 5 K‑token PDF becomes ~50 K tokens in storage, and each retrieval multiplies that further.
  2. Context window bloat – Every follow‑up question reloads the entire conversation history. Question 1 costs $0.81; by question 4 the cost rises to $3.50 because of accumulated context.
  3. Storage fees compound daily – $0.10 / GB / day adds up quickly:
    • 1 GB document ≈ $3 /month
    • 10 GB knowledge base ≈ $30 /month
  4. Hidden retrieval costs – The File Search tool not only retrieves chunks; it also augments each query with those chunks, incurring embedding, similarity search, and prompt token costs multiplied by conversation history.

Real‑World Cost Projections

Customer support bot (1 K conversations/day)

  • 5 messages per conversation
  • 2 knowledge‑base documents (≈500 pages)
  • Storage: $6 /day → $180 /month
  • Queries: ~300 K tokens/day → $300 /day

Total: ≈ $9 180 /month

Document analysis app

  • User uploads 5 PDFs (≈250 pages)
  • 10 questions per document, 3 follow‑ups each

Cost per user session: $45
100 users: $4 500 /month

My actual use case

  • 4 test questions, 1 small PDF (10 pages), 2 conversation threads

Cost: $12.47 → projected $3 100 /month at 1 K users.

The MCP Alternative: Same Features, 99 % Cost Reduction

What is MCP?

Model Context Protocol (MCP) is an open standard for connecting AI models to data sources and tools—think USB‑C for AI. As of December 9 2025, it’s an official Linux Foundation project.

Founding members include Anthropic, OpenAI, Google, Microsoft, AWS, Cloudflare, Bloomberg, and Block.

Architecture Comparison

Traditional Assistants API flow

flowchart LR
    A[User] --> B[OpenAI API]
    B --> C[Thread Storage]
    B --> D[Vector Store]
    B --> E[GPT‑4]
    E --> F[Response]
    style C fill:#f9f,stroke:#333,stroke-width:2px
    style D fill:#f9f,stroke:#333,stroke-width:2px

Metered components: thread storage ($0.10 / GB / day), vector store retrieval, token usage.

MCP flow

flowchart LR
    A[User] --> B[MCP Client]
    B --> C[Your MCP Server]
    C --> D[Cloudflare Workers]
    D --> E[Any Model]
    E --> F[Response]

You control storage and retrieval; Cloudflare Workers provide 10 M free requests/month.

Key Architectural Differences

  1. Client‑side memory – Conversation state is stored on the client, eliminating daily storage fees.

  2. Multi‑model support – One MCP server can route requests to any model:

    // Switch models per request
    const response = await mcp.callTool("search_documents", {
      query: userQuery,
      model: "groq/llama-3.3-70b-versatile" // Free tier
    });
  3. Edge deployment on Cloudflare Workers – Deploy globally in minutes with no cold starts:

    export default {
      async fetch(request, env) {
        const mcp = new MCPServer(env);
        return mcp.handle(request);
      }
    };
  4. Complete cost control – You decide chunk limits, caching, and model pricing before sending a request:

    const searchConfig = {
      maxChunks: 3,
      chunkSize: 500,
      cacheStrategy: "lru",
      model: "groq-free"
    };
    
    const estimatedCost = calculateTokens(chunks) * modelPrice;
    if (estimatedCost > threshold) {
      // fallback to cheaper model or reduce chunks
    }

My MCP Implementation

// MCP Server on Cloudflare Workers
import { MCPServer } from "@modelcontextprotocol/sdk";

interface MCPTools {
  search_documents: (query: string, maxChunks?: number) => Promise;
  analyze_pdf: (fileId: string) => Promise;
  summarize_conversation: () => Promise;
}

// Cost breakdown for the same 4 questions:
const costs = {
  workersAI_embeddings: 0.011 / 1000, // $0.001 per 1 K tokens (example)
  vectorize_storage: 0,               // Included in free tier
  // ...additional cost items as needed
};

Using MCP, the same four‑question workflow costs a fraction of the $12.47 spent with the Assistants API, demonstrating how an open protocol can dramatically reduce AI‑driven application expenses.

Back to Blog

Related posts

Read more »