Caching Strategies for LLM Systems: Exact-Match & Semantic Caching
Source: Dev.to

LLM calls are expensive in latency, tokens, and compute. Caching is one of the most effective levers to reduce cost and speed up responses. This post explains two foundational caching techniques you can implement today: Exact‑match (key‑value) caching and Semantic (embedding) caching. We cover how each works, typical implementations, pros/cons, and common pitfalls.
Why caching matters for LLM systems
Every LLM call carries three primary costs:
- Network latency – round‑trip time to the API or inference cluster.
- Token cost – many APIs charge per input + output tokens.
- Compute overhead – CPU/GPU time spent running the model.
In production applications many queries repeat (exactly or semantically). A cache allows the system to return prior results without re‑running the model, producing immediate wins in latency, throughput, and cost.
Key benefits
- Lower response time for end users.
- Reduced API bills and compute consumption.
- Higher throughput and better user experience at scale.
A thoughtful caching layer is often one of the highest‑ROI engineering efforts for LLM products.
Exact‑match (Key‑Value) caching
How it works
Exact‑match caching stores an LLM response under a deterministic key derived from the prompt (and any contextual state). When the same key is seen again, the cache returns the stored response.
Input prompt → Normalization → Hash/key → Lookup in KV store → Return stored response
Implementation notes
- Normalization (optional but recommended): trim whitespace, canonicalize newlines, remove ephemeral metadata, and ensure consistent parameter ordering.
- Key generation: use a stable hashing function (e.g., SHA‑256) over the normalized prompt plus any relevant metadata (system prompt, temperature, model name, conversation ID, schema version).
- Storage: simple in‑memory
dictfor prototypes; Redis/KeyDB for production; or a persistent object store for large responses. - Validation: store metadata with the response — model version, temperature, timestamp, source prompt — so you can safely decide whether a cached result is still valid or should be invalidated.
Simple Python example (conceptual)
import hashlib
import json
def make_key(
prompt: str,
system_prompt: str = "",
model: str = "gpt-x",
schema_version: str = "v1",
) -> str:
# Normalise whitespace
normalized = "\n".join(line.strip() for line in prompt.strip().splitlines())
payload = json.dumps(
{
"system": system_prompt,
"prompt": normalized,
"model": model,
"schema": schema_version,
},
sort_keys=True,
)
return hashlib.sha256(payload.encode()).hexdigest()
# Example usage:
# key = make_key(user_prompt, system_prompt, model_name)
# if key in kv_store:
# return kv_store[key]
When to use exact caching
- Deterministic workflows (e.g., agent step outputs).
- Repeated system prompts and templates.
- Situations where correctness requires exact reuse (no hallucination risk from mismatched context).
Advantages: simple, deterministic, zero false‑positive risk.
Limitations: low hit rate for free‑form natural language; brittle to minor prompt changes.
Semantic caching
How it works
Semantic caching stores an embedding for each prompt together with the response. For a new prompt, compute its embedding, perform a nearest‑neighbor search among cached vectors, and reuse the cached response if similarity exceeds a threshold.
Prompt → Embedding → Similarity search in vector store →
If max_sim ≥ threshold → reuse response
Implementation notes
- Embeddings: choose a consistent embedding model. Store the normalized prompt text, the embedding vector, response, and metadata (model, generation parameters, timestamp, schema version).
- Vector store: FAISS, Milvus, Pinecone, Weaviate, or Redis Vector are common options depending on scale and latency needs.
- Similarity metric: cosine similarity is standard for text embeddings. Use the same metric in indexing and querying.
- Thresholding: set a threshold that balances reuse vs. safety. Typical cosine thresholds vary by embedding model — tune on your dataset (often starting conservatively around 0.85–0.90).
Conceptual example (pseudo‑Python)
# Compute embedding for new prompt
q_vec = embed(prompt)
# Nearest‑neighbor search → returns (id, sim_score)
nearest_id, sim = vector_store.search(q_vec, k=1)
if sim >= SIM_THRESHOLD:
response = cache_lookup(nearest_id)
else:
response = call_llm(prompt)
store_embedding_and_response(q_vec, prompt, response)
Tuning similarity and safety
- Calibration: evaluate the similarity threshold on a held‑out set of paraphrases and unrelated prompts to estimate false‑positive reuse.
- Hybrid checks: for high‑risk outputs, combine semantic match with lightweight heuristics (e.g., entity overlap, output‑shape checks) or a fast reranker before returning cached content.
- Metadata gating: ensure model version, schema version, and other relevant parameters match before reusing a cached response.
Advantages
- Handles paraphrases
- Higher effective cache‑hit rate for conversational queries
Limitations
- Requires embeddings, vector storage, and careful tuning to avoid incorrect reuse
Choosing Between Exact‑Match and Semantic Caching
- Exact‑match caching – use when correctness and determinism matter and prompts are highly templated.
- Semantic caching – use when queries are natural language, paraphrases are common, and some approximation is acceptable in exchange for higher hit rates.
Hybrid approach
An effective production design usually combines both:
- Try exact‑match first.
- If it misses, fall back to semantic search.
- Store both kinds of keys and de‑duplicate on insertion.
Metrics, Monitoring, and Operational Concerns
Key metrics to track
- Cache hit rate (exact / semantic)
- End‑to‑end latency for cache hits vs. misses
- Cost saved (tokens / compute avoided)
- False‑reuse incidents (semantic false positives) and user impact
Operational concerns
- Eviction policy & TTL – balance storage costs and freshness.
- Model upgrades – invalidate or tag cache entries produced by older model versions (or bump schema version).
- Privacy & sensitivity – avoid caching PII or sensitive outputs unless encrypted and access‑controlled.
- Auditability – log when responses were served from cache and the matched key/score.
Implementation & Code
Want to see working examples? Check out the implementation with code:
Repository: VaibhavAhluwalia / llm-caching-systems
Practical implementations and experiments for building fast, scalable, and cost‑efficient Large Language Model (LLM) applications using caching techniques.
The repository includes:
- Interactive notebooks demonstrating both caching strategies
- A
requirements.txtfile for easy setup
Conclusion and What’s Next (Part 2)
Exact‑match and semantic caching are foundational. Together they allow LLM systems to be faster and cheaper while retaining the benefits of large models.
In Part 2 of this series we’ll cover other techniques.
What caching strategies have worked best in your LLM projects? Share your experiences in the comments below!
Connect with Me
- GitHub: @VaibhavAhluwalia
- LinkedIn: Vaibhav Ahluwalia

DEV Community – A space to discuss and keep up with software development and manage your software career.

