PromptCache Part I: Stop Paying Twice for the Same LLM Answer

Published: 3 days ago (February 24, 2026 at 03:23 AM EST)

3 min read

Source: Dev.to

Source: Dev.to

Cover image for PromptCache Part I: Stop Paying Twice for the Same LLM Answer

The Invisible Cost Leak in LLM Systems

If you’re running an LLM in production, you are almost certainly paying for this:

“How do I reset my password?”
“I forgot my password, what do I do?”
“Steps to reset account password?”
“Help me change password”

Different strings, same intent, same answer, different billable request.

Traditional caching doesn’t help because exact‑match fails:

"How do I reset my password?" != "Steps to reset account password?"

The meaning hasn’t changed – that’s where semantic caching comes in.

The Theory: Why This Works

LLMs convert text into vectors (embeddings). Two sentences with similar meaning produce vectors that are close together in high‑dimensional space.

Example (simplified):

"Reset my password"
      ↓
[0.12, -0.87, 0.44, ...]

"How do I change my password?"
      ↓
[0.11, -0.89, 0.41, ...]

Because the vectors are very close, we can ask:

“Have I seen something semantically similar before?”

If the similarity is high enough, we reuse the cached answer – that’s semantic caching.

How It Works in Practice

When a request comes in:

User Prompt
   ↓
Embedding
   ↓
Vector search in Redis
   ↓
High similarity?
   ↓
Yes → Return cached response
No  → Call LLM and store result

You’re adding a semantic memoization layer in front of your LLM.

Real Results

In a support‑heavy workload with repetitive queries:

~60 % cache hit rate
~50 % reduction in token usage
~40 % lower API spend

Results vary by workload density and repetition patterns, but in structured environments the impact is immediate.

Example Implementation

A simplified example using Redis vector search:

from promptcache import SemanticCache
from promptcache.backends.redis_vector import RedisVectorBackend
from promptcache.embedders.openai import OpenAIEmbedder
from promptcache.types import CacheMeta

embedder = OpenAIEmbedder(model="text-embedding-3-small")

backend = RedisVectorBackend(
    url="redis://localhost:6379/0",
    dim=embedder.dim,
)

cache = SemanticCache(
    backend=backend,
    embedder=embedder,
    namespace="support-bot",
    threshold=0.92,
)

meta = CacheMeta(
    model="gpt-4.1-mini",
    system_prompt="You are a helpful support assistant.",
)

result = cache.get_or_set(
    prompt="How can I change my password?",
    llm_call=my_llm_call,
    extract_text=lambda r: r.output_text,
    meta=meta,
)

print(result.cache_hit)

That’s all – no orchestration framework required.

GitHub:
PyPI:

Install

pip install promptcache-ai

When This Works Best

Semantic caching shines when:

Prompts are repetitive
Temperature is low
Answers are stable
Volume is high

It’s less useful for:

Highly personalized prompts
Creative writing
Rapidly changing context

In those cases, novelty dominates repetition, and caching provides diminishing returns.

The Bigger Insight

Most LLM systems are fundamentally stateless; they recompute answers even when nothing meaningful has changed. Semantic caching introduces selective memory, reusing intelligence only when it is economically justified.

Instead of endlessly tweaking prompts, sometimes the smarter move is optimizing infrastructure. If you’re building LLM systems in production, semantic caching is one of the highest‑leverage optimizations you can add.

Intelligence is expensive.
Memory is cheap.
Use both wisely.

PromptCache Part I: Stop Paying Twice for the Same LLM Answer

The Invisible Cost Leak in LLM Systems

The Theory: Why This Works

How It Works in Practice

Real Results

Example Implementation

Install

When This Works Best

The Bigger Insight

Related posts

I Built a RAG Agent From Scratch — Here's What I Actually Learned

Building a Library of Reusable AI Agent Skills

Prompt Forge Studio: I Built a Prompt Engineering IDE + SDK

OpenAI and Amazon announce strategic partnership