From Theory to Practice: Demystifying the Key-Value Cache in Modern LLMs

Published: (December 5, 2025 at 07:51 AM EST)
3 min read
Source: Dev.to

Source: Dev.to

Introduction — What is Key‑Value Cache and Why We Need It?

KV Cache illustration

My Journey into the LLM Landscape

While I don’t come from a traditional data‑science or deep‑learning background, the past few years of working with AI and generative models have taught me to learn concepts like the KV cache pragmatically: by reading blogs and technical books, and by experimenting with sample code. This hands‑on approach helps translate mathematical ideas into concrete, functional components.

Source: NVIDIA blog

The KV Cache (Key‑Value Cache) is a dedicated memory space that stores the intermediate Key (K) and Value (V) vectors produced by the self‑attention mechanism for tokens that have already been processed. By re‑using these vectors during subsequent inference steps, the cache dramatically speeds up generation.

Why KV Cache Is Necessary

Transformer models (e.g., the GPT family) generate text autoregressively: each new token is predicted based on all previously generated tokens. Without a KV cache, the model must recompute K and V for the entire sequence at every step, causing the computational cost to grow quadratically O(n²) with sequence length n. This makes long‑range generation prohibitively slow and expensive.

How KV Cache Works

  1. Prefill Phase (First Token / Prompt)
    The model processes the input prompt, computing Q, K, and V for every token. The K and V vectors are stored in the KV cache.

  2. Decode Phase (Subsequent Tokens)

    • Only the Query (Q) for the newly generated token is computed.
    • Previously computed K and V vectors are retrieved directly from the cache.
    • The new token’s own K and V vectors are calculated and appended to the cache.
  3. Result
    The attention computation for each new token scales linearly O(n) rather than quadratically, yielding much faster inference.

Trade‑off

The primary trade‑off is increased GPU memory usage: the cached K and V tensors can dominate VRAM consumption for very long sequences or large batch sizes.

KV Cache memory trade‑off
Source: (Sebastian Raschka, PhD)

Explanation Through Sample Conceptual Code

Below is a minimal Python example that simulates the attention step while using a KV cache.

# kv_cache_demo.py
KV_CACHE = {
    "keys": [],   # stored K vectors
    "values": []  # stored V vectors
}

def generate_next_token(new_token, sequence_so_far):
    """
    Simulates the attention step for a new token, using/updating the KV cache.
    """
    print(f"\n--- Processing Token: '{new_token}' ---")

    # 1️⃣ Compute Query for the new token
    Q_new = f"Q_vec({new_token})"
    print(f"1. Computed Query (Q): {Q_new}")

    # 2️⃣ Compute Key and Value for the new token only
    K_new = f"K_vec({new_token})"
    V_new = f"V_vec({new_token})"
    print(f"2. Computed Key (K) and Value (V): {K_new}, {V_new}")

    # 3️⃣ Build full attention matrices using cached + new vectors
    K_full = KV_CACHE["keys"] + [K_new]
    V_full = KV_CACHE["values"] + [V_new]
    print(f"3. Full Attention Keys (cached + new): {K_full}")

    # 4️⃣ Perform (conceptual) attention
    attention_output = f"Attention({Q_new}, {K_full}, {V_full})"
    print(f"4. Attention Calculation: {attention_output}")

    # 5️⃣ Update the cache
    KV_CACHE["keys"].append(K_new)
    KV_CACHE["values"].append(V_new)
    print(f"5. KV Cache updated – size now: {len(KV_CACHE['keys'])} tokens")

    return "Predicted_Token"

# ---- Demo ----
print("=== Initial Prompt Phase: 'Hello, world' ===")
prompt_tokens = ["Hello,", "world"]

# Process prompt tokens
generate_next_token(prompt_tokens[0], [])
generate_next_token(prompt_tokens[1], prompt_tokens[:1])

print("\n=== Generation Phase: Predicting the 3rd token ===")
next_token = "(Model predicts 'how')"
generate_next_token(next_token, prompt_tokens)

Key Takeaways from the Code

  • Cache Utilization: When processing a new token, the model re‑uses KV_CACHE['keys'] and KV_CACHE['values'], which already contain vectors for all previous tokens.
  • Minimal Computation: Only the query, key, and value for the newest token are computed at each step.
  • Efficiency: Without the cache, the model would need to recompute K and V for every token in the entire history at every step, leading to redundant work.

The concepts demonstrated here can be extended to a full PyTorch implementation of multi‑head attention, where the cache is managed as tensors (self.cache_k, self.cache_v) and the query size remains constant.

Back to Blog

Related posts

Read more »

Orchestrating AI Agents to create Memes

!The meme agent in actionhttps://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.a...

AI Agent on Kaggle

Introduction: Stepping into the World of AI Agents When I first embarked on the AI Agents Intensive program, I realized that agents were not just theoretical c...