Your LLM Wrapper is Leaking Money: The Architecture of Semantic Caching
Source: Dev.to
The Problem: Wallet‑Burning LLM Calls
In the rush to deploy GenAI features, most engineering teams hit three common hurdles: the 504 Gateway Timeout, the Hallucination Loop, and— the most painful one—the Wallet Burner.
Production logs show startups spending thousands of dollars per month on OpenAI bills because they treat LLM APIs like standard REST endpoints and implement caching incorrectly.
When working with Large Language Models, simple key‑value caching is ineffective. You need Semantic Caching to stop paying for the same query twice, even when users phrase it differently.
The Bad Pattern: Exact‑Match Caching
Most backend engineers start by wrapping the API call in a simple Redis check that hashes the user’s prompt and returns a cached response if the exact string matches.
# The Naive Approach
def get_ai_response(user_query, mock_llm, cache):
# PROBLEM: Only checks exact matches.
if user_query in cache:
return cache[user_query]['response']
response = mock_llm.generate(user_query)
cache[user_query] = response
return response
Why This Fails in Production
Human language is inconsistent:
- User A: “What is your pricing?”
- User B: “How much does it cost?”
- User C: “Price list please”
A key‑value store treats these as three distinct keys, resulting in three separate LLM calls. In high‑traffic apps, this redundancy can account for 40‑60 % of total token usage.
Example Cost Breakdown
| Requests / day | Without semantic caching | With 50 % cache hit rate |
|---|---|---|
| 1,000 | $150 / month* | $75 / month |
*Assumes 1,500 input tokens per request at $0.005 per 1K tokens.
Embedding cost overhead: ≈ $2 / month (text-embedding-3-small at $0.00002 per 1K tokens).
Net savings: $73 / month → $876 / year.
The Good Pattern: Semantic Caching
To move from lexical equality to semantic similarity, we use vector embeddings.
Architecture
- Embed – Convert the incoming user query into a vector using a cheap model (e.g.,
text-embedding-3-small). - Search – Compare this vector against stored vectors from previous queries.
- Threshold – Compute Cosine Similarity; if the score exceeds a chosen threshold (e.g., 0.9), return the cached response.
Implementation
import math
from openai import OpenAI
# 1. Cosine similarity
def cosine_similarity(v1, v2):
dot_product = sum(a * b for a, b in zip(v1, v2))
norm_a = math.sqrt(sum(a * a for a in v1))
norm_b = math.sqrt(sum(b * b for b in v2))
return dot_product / (norm_a * norm_b)
def get_ai_response_semantic(user_query, llm, cache, client):
# 2. Embed the current query
embed_resp = client.embeddings.create(
model="text-embedding-3-small",
input=user_query
)
query_embedding = embed_resp.data[0].embedding
# 3. Threshold (tune as needed)
threshold = 0.9
best_sim = -1
best_response = None
# 4. Linear search (for learning; replace with ANN DB in prod)
for cached_query, data in cache.items():
sim = cosine_similarity(query_embedding, data["embedding"])
if sim > best_sim:
best_sim = sim
best_response = data["response"]
# 5. Decision
if best_sim > threshold:
print(f"Cache Hit! Similarity: {best_sim:.4f}")
return best_response
# 6. Cache miss – pay the token tax
response = llm.generate(user_query)
# Store both response and embedding for future matches
cache[user_query] = {
"response": response,
"embedding": query_embedding
}
return response
Note: The linear search above is only for illustration. When you have more than ~100 cached queries, switch to a vector database with Approximate Nearest Neighbor (ANN) indexing (e.g., pgvector, Pinecone, Weaviate, Qdrant).
The Danger Zone: False Positives
Setting the similarity threshold too low (e.g., 0.7) can cause False Positive Cache Hits.
- Query: “Can I delete my account?”
- Cached: “Can I delete my post?”
- Similarity: 0.85
Returning instructions for deleting a post when the user wants to delete their account creates a serious UX issue.
Production Tip: For sensitive actions, add a re‑ranker step. After finding a potential cache hit, run a quick cross‑encoder model to verify that the two queries truly entail the same output.
Summary
- Exact‑Match Caching: Easy to implement, but expensive at scale.
- Semantic Caching: Requires more engineering effort, but can cut API bills by ~40 %.
Building profitable AI applications hinges on solid systems engineering, not just model selection.
Where to Practice
Understanding semantic caching conceptually is one thing; debugging it under production constraints—balancing threshold tuning, false‑positive rates, and embedding costs—is what separates theory from mastery.
TENTROPY provides a hands‑on challenge that simulates a “Wallet Burner” scenario. You’ll work with a live codebase, see a burning API bill, and implement the vector logic to stop the bleed.