PromptCache Part I: Stop Paying Twice for the Same LLM Answer
Source: Dev.to

The Invisible Cost Leak in LLM Systems
If you’re running an LLM in production, you are almost certainly paying for this:
- “How do I reset my password?”
- “I forgot my password, what do I do?”
- “Steps to reset account password?”
- “Help me change password”
Different strings, same intent, same answer, different billable request.
Traditional caching doesn’t help because exact‑match fails:
"How do I reset my password?" != "Steps to reset account password?"
The meaning hasn’t changed – that’s where semantic caching comes in.
The Theory: Why This Works
LLMs convert text into vectors (embeddings). Two sentences with similar meaning produce vectors that are close together in high‑dimensional space.
Example (simplified):
"Reset my password"
↓
[0.12, -0.87, 0.44, ...]
"How do I change my password?"
↓
[0.11, -0.89, 0.41, ...]
Because the vectors are very close, we can ask:
“Have I seen something semantically similar before?”
If the similarity is high enough, we reuse the cached answer – that’s semantic caching.
How It Works in Practice
When a request comes in:
User Prompt
↓
Embedding
↓
Vector search in Redis
↓
High similarity?
↓
Yes → Return cached response
No → Call LLM and store result
You’re adding a semantic memoization layer in front of your LLM.
Real Results
In a support‑heavy workload with repetitive queries:
- ~60 % cache hit rate
- ~50 % reduction in token usage
- ~40 % lower API spend
Results vary by workload density and repetition patterns, but in structured environments the impact is immediate.
Example Implementation
A simplified example using Redis vector search:
from promptcache import SemanticCache
from promptcache.backends.redis_vector import RedisVectorBackend
from promptcache.embedders.openai import OpenAIEmbedder
from promptcache.types import CacheMeta
embedder = OpenAIEmbedder(model="text-embedding-3-small")
backend = RedisVectorBackend(
url="redis://localhost:6379/0",
dim=embedder.dim,
)
cache = SemanticCache(
backend=backend,
embedder=embedder,
namespace="support-bot",
threshold=0.92,
)
meta = CacheMeta(
model="gpt-4.1-mini",
system_prompt="You are a helpful support assistant.",
)
result = cache.get_or_set(
prompt="How can I change my password?",
llm_call=my_llm_call,
extract_text=lambda r: r.output_text,
meta=meta,
)
print(result.cache_hit)
That’s all – no orchestration framework required.
GitHub:
PyPI:
Install
pip install promptcache-ai
When This Works Best
Semantic caching shines when:
- Prompts are repetitive
- Temperature is low
- Answers are stable
- Volume is high
It’s less useful for:
- Highly personalized prompts
- Creative writing
- Rapidly changing context
In those cases, novelty dominates repetition, and caching provides diminishing returns.
The Bigger Insight
Most LLM systems are fundamentally stateless; they recompute answers even when nothing meaningful has changed. Semantic caching introduces selective memory, reusing intelligence only when it is economically justified.
Instead of endlessly tweaking prompts, sometimes the smarter move is optimizing infrastructure. If you’re building LLM systems in production, semantic caching is one of the highest‑leverage optimizations you can add.
Intelligence is expensive.
Memory is cheap.
Use both wisely.