Caching Strategies for LLM Systems: Exact-Match & Semantic Caching

Published: 3 weeks ago (January 17, 2026 at 11:50 AM EST)

5 min read

Source: Dev.to

Cover image for Caching Strategies for LLM Systems: Exact-Match & Semantic Caching

LLM calls are expensive in latency, tokens, and compute. Caching is one of the most effective levers to reduce cost and speed up responses. This post explains two foundational caching techniques you can implement today: Exact‑match (key‑value) caching and Semantic (embedding) caching. We cover how each works, typical implementations, pros/cons, and common pitfalls.

Why caching matters for LLM systems

Every LLM call carries three primary costs:

Network latency – round‑trip time to the API or inference cluster.
Token cost – many APIs charge per input + output tokens.
Compute overhead – CPU/GPU time spent running the model.

In production applications many queries repeat (exactly or semantically). A cache allows the system to return prior results without re‑running the model, producing immediate wins in latency, throughput, and cost.

Key benefits

Lower response time for end users.
Reduced API bills and compute consumption.
Higher throughput and better user experience at scale.

A thoughtful caching layer is often one of the highest‑ROI engineering efforts for LLM products.

Exact‑match (Key‑Value) caching

How it works

Exact‑match caching stores an LLM response under a deterministic key derived from the prompt (and any contextual state). When the same key is seen again, the cache returns the stored response.

Input prompt → Normalization → Hash/key → Lookup in KV store → Return stored response

Implementation notes

Normalization (optional but recommended): trim whitespace, canonicalize newlines, remove ephemeral metadata, and ensure consistent parameter ordering.
Key generation: use a stable hashing function (e.g., SHA‑256) over the normalized prompt plus any relevant metadata (system prompt, temperature, model name, conversation ID, schema version).
Storage: simple in‑memory dict for prototypes; Redis/KeyDB for production; or a persistent object store for large responses.
Validation: store metadata with the response — model version, temperature, timestamp, source prompt — so you can safely decide whether a cached result is still valid or should be invalidated.

Simple Python example (conceptual)

import hashlib
import json

def make_key(
    prompt: str,
    system_prompt: str = "",
    model: str = "gpt-x",
    schema_version: str = "v1",
) -> str:
    # Normalise whitespace
    normalized = "\n".join(line.strip() for line in prompt.strip().splitlines())
    payload = json.dumps(
        {
            "system": system_prompt,
            "prompt": normalized,
            "model": model,
            "schema": schema_version,
        },
        sort_keys=True,
    )
    return hashlib.sha256(payload.encode()).hexdigest()

# Example usage:
# key = make_key(user_prompt, system_prompt, model_name)
# if key in kv_store:
#     return kv_store[key]

When to use exact caching

Deterministic workflows (e.g., agent step outputs).
Repeated system prompts and templates.
Situations where correctness requires exact reuse (no hallucination risk from mismatched context).

Advantages: simple, deterministic, zero false‑positive risk.
Limitations: low hit rate for free‑form natural language; brittle to minor prompt changes.

Semantic caching

How it works

Semantic caching stores an embedding for each prompt together with the response. For a new prompt, compute its embedding, perform a nearest‑neighbor search among cached vectors, and reuse the cached response if similarity exceeds a threshold.

Prompt → Embedding → Similarity search in vector store → 
If max_sim ≥ threshold → reuse response

Implementation notes

Embeddings: choose a consistent embedding model. Store the normalized prompt text, the embedding vector, response, and metadata (model, generation parameters, timestamp, schema version).
Vector store: FAISS, Milvus, Pinecone, Weaviate, or Redis Vector are common options depending on scale and latency needs.
Similarity metric: cosine similarity is standard for text embeddings. Use the same metric in indexing and querying.
Thresholding: set a threshold that balances reuse vs. safety. Typical cosine thresholds vary by embedding model — tune on your dataset (often starting conservatively around 0.85–0.90).

Conceptual example (pseudo‑Python)

# Compute embedding for new prompt
q_vec = embed(prompt)

# Nearest‑neighbor search → returns (id, sim_score)
nearest_id, sim = vector_store.search(q_vec, k=1)

if sim >= SIM_THRESHOLD:
    response = cache_lookup(nearest_id)
else:
    response = call_llm(prompt)
    store_embedding_and_response(q_vec, prompt, response)

Tuning similarity and safety

Calibration: evaluate the similarity threshold on a held‑out set of paraphrases and unrelated prompts to estimate false‑positive reuse.
Hybrid checks: for high‑risk outputs, combine semantic match with lightweight heuristics (e.g., entity overlap, output‑shape checks) or a fast reranker before returning cached content.
Metadata gating: ensure model version, schema version, and other relevant parameters match before reusing a cached response.

Advantages

Handles paraphrases
Higher effective cache‑hit rate for conversational queries

Limitations

Requires embeddings, vector storage, and careful tuning to avoid incorrect reuse

Choosing Between Exact‑Match and Semantic Caching

Exact‑match caching – use when correctness and determinism matter and prompts are highly templated.
Semantic caching – use when queries are natural language, paraphrases are common, and some approximation is acceptable in exchange for higher hit rates.

Hybrid approach

An effective production design usually combines both:

Try exact‑match first.
If it misses, fall back to semantic search.
Store both kinds of keys and de‑duplicate on insertion.

Metrics, Monitoring, and Operational Concerns

Key metrics to track

Cache hit rate (exact / semantic)
End‑to‑end latency for cache hits vs. misses
Cost saved (tokens / compute avoided)
False‑reuse incidents (semantic false positives) and user impact

Operational concerns

Eviction policy & TTL – balance storage costs and freshness.
Model upgrades – invalidate or tag cache entries produced by older model versions (or bump schema version).
Privacy & sensitivity – avoid caching PII or sensitive outputs unless encrypted and access‑controlled.
Auditability – log when responses were served from cache and the matched key/score.

Implementation & Code

Want to see working examples? Check out the implementation with code:

Repository: VaibhavAhluwalia / llm-caching-systems

Practical implementations and experiments for building fast, scalable, and cost‑efficient Large Language Model (LLM) applications using caching techniques.

The repository includes:

Interactive notebooks demonstrating both caching strategies
A requirements.txt file for easy setup

Conclusion and What’s Next (Part 2)

Exact‑match and semantic caching are foundational. Together they allow LLM systems to be faster and cheaper while retaining the benefits of large models.

In Part 2 of this series we’ll cover other techniques.

What caching strategies have worked best in your LLM projects? Share your experiences in the comments below!

Connect with Me

GitHub: @VaibhavAhluwalia
LinkedIn: Vaibhav Ahluwalia

DEV Community banner

DEV Community – A space to discuss and keep up with software development and manage your software career.

DEV logo

Caching Strategies for LLM Systems: Exact-Match & Semantic Caching

Why caching matters for LLM systems

Key benefits

Exact‑match (Key‑Value) caching

How it works

Implementation notes

Simple Python example (conceptual)

When to use exact caching

Semantic caching

How it works

Implementation notes

Conceptual example (pseudo‑Python)

Tuning similarity and safety

Advantages

Limitations

Choosing Between Exact‑Match and Semantic Caching

Hybrid approach

Metrics, Monitoring, and Operational Concerns

Key metrics to track

Operational concerns

Implementation & Code

Conclusion and What’s Next (Part 2)

Connect with Me

Related posts

The assistant axis: situating and stabilizing the character of LLMs

GLM-4.7-Flash

Accelerating AI Inference Workflows with the Atomic Inference Boilerplate

Show HN: Intent Layer: A context engineering skill for AI agents

Why caching matters for LLM systems

Key benefits

Exact‑match (Key‑Value) caching

How it works

Implementation notes

Simple Python example (conceptual)

When to use exact caching

Semantic caching

How it works

Implementation notes

Conceptual example (pseudo‑Python)

Tuning similarity and safety

Advantages

Limitations

Choosing Between Exact‑Match and Semantic Caching

Hybrid approach

Metrics, Monitoring, and Operational Concerns

Key metrics to track

Operational concerns

Implementation & Code

Conclusion and What’s Next (Part 2)

Connect with Me

Related posts

The assistant axis: situating and stabilizing the character of LLMs

GLM-4.7-Flash

Accelerating AI Inference Workflows with the Atomic Inference Boilerplate

Show HN: Intent Layer: A context engineering skill for AI agents

Conclusion and What’s Next (Part 2)