Redis Caching in RAG: Normalized Queries, Semantic Traps & What Actually Worked
Source: Dev.to
Why Redis Caching Works for RAG
RAG pipelines are expensive because they repeatedly perform:
- Embedding generation
- Vector retrieval
- Context assembly
- LLM inference
For many user questions—especially in internal tools—the answer doesn’t change between requests. Redis provides:
- Sub‑millisecond reads
- TTL‑based eviction
- Simple operational model
- Predictable cost
What a Normalized Query Really Means
The Problem
Different phrasings of the same intent generate different cache keys:
"Explain docker networking"
"Can you explain Docker networking?"
"docker networking explained"
If we hash the raw query, Redis treats each as a distinct key, resulting in low hit rates.
Goal
- Improve cache‑hit rate
- Avoid returning wrong answers
Safe Normalizations
- Lowercasing
- Trimming whitespace
- Removing punctuation
- Collapsing filler phrases
Dangerous Normalizations
- Removing numbers
- Collapsing version strings
- Replacing domain terms
- Synonym substitution
- Semantic guessing
In RAG, a wrong cache hit is far worse than a miss.
Text Normalization Example (Python)
import re
FILLER_PHRASES = ["can you", "please", "tell me", "explain"]
def normalize_query(query: str) -> str:
q = query.lower().strip()
for phrase in FILLER_PHRASES:
q = q.replace(phrase, "")
q = re.sub(r"[^\w\s]", "", q) # remove punctuation
q = re.sub(r"\s+", " ", q) # collapse whitespace
return q.strip()
What this deliberately avoids:
- NLP stop‑word lists
- Embeddings
- Synonym expansion
Result: predictable and correct normalization.
Building a Robust Cache Key
Beyond the normalized text, include model and retrieval configuration:
cache_key = hash(
model_name +
normalized_query +
retrieval_config
)
This prevents:
- Reusing answers across different models
- Mixing retrieval strategies
- Silent correctness bugs
Semantic Caching: When It’s Acceptable
Semantic caching can be used when:
- Questions are FAQs
- Answers are generic
- Correctness tolerance is high
- An exact‑cache fallback exists
Safe pattern: two‑tier caching
- Exact cache – uses normalized query (authoritative)
- Semantic cache – optional, guarded, never authoritative
Intent‑Level Normalization for Structured Queries
When RAG involves non‑text queries (SQL, Athena, APIs, logs, metrics), the “query” is an intent plus constraints. Cache a canonical representation instead of raw text.
{
"source": "athena",
"table": "deployments",
"metrics": ["count"],
"filters": {
"status": "FAILED",
"time_range": "LAST_7_DAYS"
}
}
Hash the canonical JSON (e.g., after sorting keys) to obtain a deterministic cache key.
Final Setup
- Redis for fast cache storage
- Conservative text normalization for free‑form queries
- Intent‑level normalization for structured queries
- No semantic caching for critical paths
- TTL aligned with data freshness
Results
- ~40 % cost reduction
- Lower latency
- Zero correctness regressions
- Predictable behavior
Most importantly, the system regained trust.
Takeaways
- Normalizing form, not meaning, is key
- Over‑normalization silently breaks RAG
- Semantic caching should be optional, never default
- Structured queries need intent‑level normalization
- Determinism beats cleverness
Caching in RAG isn’t just about saving tokens; it’s about preserving correctness. When normalization is done right, Redis becomes a super‑power.
P.S. This is a deceptively hard problem with no one‑size‑fits‑all solution. Different RAG setups demand different normalization strategies based on how context is retrieved, structured, and validated. The approach described here is a conceptual guide, not a drop‑in implementation.