Redis Caching in RAG: Normalized Queries, Semantic Traps & What Actually Worked

Published: (December 28, 2025 at 01:34 AM EST)
2 min read
Source: Dev.to

Source: Dev.to

Why Redis Caching Works for RAG

RAG pipelines are expensive because they repeatedly perform:

  • Embedding generation
  • Vector retrieval
  • Context assembly
  • LLM inference

For many user questions—especially in internal tools—the answer doesn’t change between requests. Redis provides:

  • Sub‑millisecond reads
  • TTL‑based eviction
  • Simple operational model
  • Predictable cost

What a Normalized Query Really Means

The Problem

Different phrasings of the same intent generate different cache keys:

"Explain docker networking"
"Can you explain Docker networking?"
"docker networking explained"

If we hash the raw query, Redis treats each as a distinct key, resulting in low hit rates.

Goal

  • Improve cache‑hit rate
  • Avoid returning wrong answers

Safe Normalizations

  • Lowercasing
  • Trimming whitespace
  • Removing punctuation
  • Collapsing filler phrases

Dangerous Normalizations

  • Removing numbers
  • Collapsing version strings
  • Replacing domain terms
  • Synonym substitution
  • Semantic guessing

In RAG, a wrong cache hit is far worse than a miss.

Text Normalization Example (Python)

import re

FILLER_PHRASES = ["can you", "please", "tell me", "explain"]

def normalize_query(query: str) -> str:
    q = query.lower().strip()

    for phrase in FILLER_PHRASES:
        q = q.replace(phrase, "")

    q = re.sub(r"[^\w\s]", "", q)   # remove punctuation
    q = re.sub(r"\s+", " ", q)      # collapse whitespace

    return q.strip()

What this deliberately avoids:

  • NLP stop‑word lists
  • Embeddings
  • Synonym expansion

Result: predictable and correct normalization.

Building a Robust Cache Key

Beyond the normalized text, include model and retrieval configuration:

cache_key = hash(
    model_name +
    normalized_query +
    retrieval_config
)

This prevents:

  • Reusing answers across different models
  • Mixing retrieval strategies
  • Silent correctness bugs

Semantic Caching: When It’s Acceptable

Semantic caching can be used when:

  • Questions are FAQs
  • Answers are generic
  • Correctness tolerance is high
  • An exact‑cache fallback exists

Safe pattern: two‑tier caching

  1. Exact cache – uses normalized query (authoritative)
  2. Semantic cache – optional, guarded, never authoritative

Intent‑Level Normalization for Structured Queries

When RAG involves non‑text queries (SQL, Athena, APIs, logs, metrics), the “query” is an intent plus constraints. Cache a canonical representation instead of raw text.

{
  "source": "athena",
  "table": "deployments",
  "metrics": ["count"],
  "filters": {
    "status": "FAILED",
    "time_range": "LAST_7_DAYS"
  }
}

Hash the canonical JSON (e.g., after sorting keys) to obtain a deterministic cache key.

Final Setup

  • Redis for fast cache storage
  • Conservative text normalization for free‑form queries
  • Intent‑level normalization for structured queries
  • No semantic caching for critical paths
  • TTL aligned with data freshness

Results

  • ~40 % cost reduction
  • Lower latency
  • Zero correctness regressions
  • Predictable behavior

Most importantly, the system regained trust.

Takeaways

  • Normalizing form, not meaning, is key
  • Over‑normalization silently breaks RAG
  • Semantic caching should be optional, never default
  • Structured queries need intent‑level normalization
  • Determinism beats cleverness

Caching in RAG isn’t just about saving tokens; it’s about preserving correctness. When normalization is done right, Redis becomes a super‑power.


P.S. This is a deceptively hard problem with no one‑size‑fits‑all solution. Different RAG setups demand different normalization strategies based on how context is retrieved, structured, and validated. The approach described here is a conceptual guide, not a drop‑in implementation.

Back to Blog

Related posts

Read more »