Your LLM Wrapper is Leaking Money: The Architecture of Semantic Caching

Published: 1 week ago (December 7, 2025 at 04:25 AM EST)

4 min read

Source: Dev.to

The Problem: Wallet‑Burning LLM Calls

In the rush to deploy GenAI features, most engineering teams hit three common hurdles: the 504 Gateway Timeout, the Hallucination Loop, and— the most painful one—the Wallet Burner.

Production logs show startups spending thousands of dollars per month on OpenAI bills because they treat LLM APIs like standard REST endpoints and implement caching incorrectly.

When working with Large Language Models, simple key‑value caching is ineffective. You need Semantic Caching to stop paying for the same query twice, even when users phrase it differently.

The Bad Pattern: Exact‑Match Caching

Most backend engineers start by wrapping the API call in a simple Redis check that hashes the user’s prompt and returns a cached response if the exact string matches.

# The Naive Approach
def get_ai_response(user_query, mock_llm, cache):
    # PROBLEM: Only checks exact matches.
    if user_query in cache:
        return cache[user_query]['response']

    response = mock_llm.generate(user_query)
    cache[user_query] = response
    return response

Why This Fails in Production

Human language is inconsistent:

User A: “What is your pricing?”
User B: “How much does it cost?”
User C: “Price list please”

A key‑value store treats these as three distinct keys, resulting in three separate LLM calls. In high‑traffic apps, this redundancy can account for 40‑60 % of total token usage.

Example Cost Breakdown

Requests / day	Without semantic caching	With 50 % cache hit rate
1,000	$150 / month*	$75 / month

*Assumes 1,500 input tokens per request at $0.005 per 1K tokens.

Embedding cost overhead: ≈ $2 / month (text-embedding-3-small at $0.00002 per 1K tokens).

Net savings: $73 / month → $876 / year.

The Good Pattern: Semantic Caching

To move from lexical equality to semantic similarity, we use vector embeddings.

Architecture

Embed – Convert the incoming user query into a vector using a cheap model (e.g., text-embedding-3-small).
Search – Compare this vector against stored vectors from previous queries.
Threshold – Compute Cosine Similarity; if the score exceeds a chosen threshold (e.g., 0.9), return the cached response.

Implementation

import math
from openai import OpenAI

# 1. Cosine similarity
def cosine_similarity(v1, v2):
    dot_product = sum(a * b for a, b in zip(v1, v2))
    norm_a = math.sqrt(sum(a * a for a in v1))
    norm_b = math.sqrt(sum(b * b for b in v2))
    return dot_product / (norm_a * norm_b)

def get_ai_response_semantic(user_query, llm, cache, client):
    # 2. Embed the current query
    embed_resp = client.embeddings.create(
        model="text-embedding-3-small",
        input=user_query
    )
    query_embedding = embed_resp.data[0].embedding

    # 3. Threshold (tune as needed)
    threshold = 0.9

    best_sim = -1
    best_response = None

    # 4. Linear search (for learning; replace with ANN DB in prod)
    for cached_query, data in cache.items():
        sim = cosine_similarity(query_embedding, data["embedding"])
        if sim > best_sim:
            best_sim = sim
            best_response = data["response"]

    # 5. Decision
    if best_sim > threshold:
        print(f"Cache Hit! Similarity: {best_sim:.4f}")
        return best_response

    # 6. Cache miss – pay the token tax
    response = llm.generate(user_query)

    # Store both response and embedding for future matches
    cache[user_query] = {
        "response": response,
        "embedding": query_embedding
    }
    return response

Note: The linear search above is only for illustration. When you have more than ~100 cached queries, switch to a vector database with Approximate Nearest Neighbor (ANN) indexing (e.g., pgvector, Pinecone, Weaviate, Qdrant).

The Danger Zone: False Positives

Setting the similarity threshold too low (e.g., 0.7) can cause False Positive Cache Hits.

Query: “Can I delete my account?”
Cached: “Can I delete my post?”
Similarity: 0.85

Returning instructions for deleting a post when the user wants to delete their account creates a serious UX issue.

Production Tip: For sensitive actions, add a re‑ranker step. After finding a potential cache hit, run a quick cross‑encoder model to verify that the two queries truly entail the same output.

Summary

Exact‑Match Caching: Easy to implement, but expensive at scale.
Semantic Caching: Requires more engineering effort, but can cut API bills by ~40 %.

Building profitable AI applications hinges on solid systems engineering, not just model selection.

Where to Practice

Understanding semantic caching conceptually is one thing; debugging it under production constraints—balancing threshold tuning, false‑positive rates, and embedding costs—is what separates theory from mastery.

TENTROPY provides a hands‑on challenge that simulates a “Wallet Burner” scenario. You’ll work with a live codebase, see a burning API bill, and implement the vector logic to stop the bleed.

Try the Semantic Caching Challenge (The Wallet Burner)

Your LLM Wrapper is Leaking Money: The Architecture of Semantic Caching

The Problem: Wallet‑Burning LLM Calls

The Bad Pattern: Exact‑Match Caching

Why This Fails in Production

Example Cost Breakdown

The Good Pattern: Semantic Caching

Architecture

Implementation

The Danger Zone: False Positives

Summary

Where to Practice

Related posts

We found our site was slow in Singapore but perfect in Europe — here's why

I put a Game Boy inside ChatGPT (ChatGPT Apps)

Advent of AI - Day 13: Goose Terminal Integration

A Day in the Life of a Marketing Manager Using Microsoft Planner