Redis Caching in RAG: Normalized Queries, Semantic Traps & What Actually Worked

Published: 1 month ago (December 28, 2025 at 01:34 AM EST)

2 min read

Source: Dev.to

Source: Dev.to

Why Redis Caching Works for RAG

RAG pipelines are expensive because they repeatedly perform:

Embedding generation
Vector retrieval
Context assembly
LLM inference

For many user questions—especially in internal tools—the answer doesn’t change between requests. Redis provides:

Sub‑millisecond reads
TTL‑based eviction
Simple operational model
Predictable cost

What a Normalized Query Really Means

The Problem

Different phrasings of the same intent generate different cache keys:

"Explain docker networking"
"Can you explain Docker networking?"
"docker networking explained"

If we hash the raw query, Redis treats each as a distinct key, resulting in low hit rates.

Goal

Improve cache‑hit rate
Avoid returning wrong answers

Safe Normalizations

Lowercasing
Trimming whitespace
Removing punctuation
Collapsing filler phrases

Dangerous Normalizations

Removing numbers
Collapsing version strings
Replacing domain terms
Synonym substitution
Semantic guessing

In RAG, a wrong cache hit is far worse than a miss.

Text Normalization Example (Python)

import re

FILLER_PHRASES = ["can you", "please", "tell me", "explain"]

def normalize_query(query: str) -> str:
    q = query.lower().strip()

    for phrase in FILLER_PHRASES:
        q = q.replace(phrase, "")

    q = re.sub(r"[^\w\s]", "", q)   # remove punctuation
    q = re.sub(r"\s+", " ", q)      # collapse whitespace

    return q.strip()

What this deliberately avoids:

NLP stop‑word lists
Embeddings
Synonym expansion

Result: predictable and correct normalization.

Building a Robust Cache Key

Beyond the normalized text, include model and retrieval configuration:

cache_key = hash(
    model_name +
    normalized_query +
    retrieval_config
)

This prevents:

Reusing answers across different models
Mixing retrieval strategies
Silent correctness bugs

Semantic Caching: When It’s Acceptable

Semantic caching can be used when:

Questions are FAQs
Answers are generic
Correctness tolerance is high
An exact‑cache fallback exists

Safe pattern: two‑tier caching

Exact cache – uses normalized query (authoritative)
Semantic cache – optional, guarded, never authoritative

Intent‑Level Normalization for Structured Queries

When RAG involves non‑text queries (SQL, Athena, APIs, logs, metrics), the “query” is an intent plus constraints. Cache a canonical representation instead of raw text.

{
  "source": "athena",
  "table": "deployments",
  "metrics": ["count"],
  "filters": {
    "status": "FAILED",
    "time_range": "LAST_7_DAYS"
  }
}

Hash the canonical JSON (e.g., after sorting keys) to obtain a deterministic cache key.

Final Setup

Redis for fast cache storage
Conservative text normalization for free‑form queries
Intent‑level normalization for structured queries
No semantic caching for critical paths
TTL aligned with data freshness

Results

~40 % cost reduction
Lower latency
Zero correctness regressions
Predictable behavior

Most importantly, the system regained trust.

Takeaways

Normalizing form, not meaning, is key
Over‑normalization silently breaks RAG
Semantic caching should be optional, never default
Structured queries need intent‑level normalization
Determinism beats cleverness

Caching in RAG isn’t just about saving tokens; it’s about preserving correctness. When normalization is done right, Redis becomes a super‑power.

P.S. This is a deceptively hard problem with no one‑size‑fits‑all solution. Different RAG setups demand different normalization strategies based on how context is retrieved, structured, and validated. The approach described here is a conceptual guide, not a drop‑in implementation.