From Theory to Practice: Demystifying the Key-Value Cache in Modern LLMs
Source: Dev.to
Introduction — What is Key‑Value Cache and Why We Need It?

My Journey into the LLM Landscape
While I don’t come from a traditional data‑science or deep‑learning background, the past few years of working with AI and generative models have taught me to learn concepts like the KV cache pragmatically: by reading blogs and technical books, and by experimenting with sample code. This hands‑on approach helps translate mathematical ideas into concrete, functional components.

The KV Cache (Key‑Value Cache) is a dedicated memory space that stores the intermediate Key (K) and Value (V) vectors produced by the self‑attention mechanism for tokens that have already been processed. By re‑using these vectors during subsequent inference steps, the cache dramatically speeds up generation.
Why KV Cache Is Necessary
Transformer models (e.g., the GPT family) generate text autoregressively: each new token is predicted based on all previously generated tokens. Without a KV cache, the model must recompute K and V for the entire sequence at every step, causing the computational cost to grow quadratically O(n²) with sequence length n. This makes long‑range generation prohibitively slow and expensive.
How KV Cache Works
-
Prefill Phase (First Token / Prompt)
The model processes the input prompt, computing Q, K, and V for every token. The K and V vectors are stored in the KV cache. -
Decode Phase (Subsequent Tokens)
- Only the Query (Q) for the newly generated token is computed.
- Previously computed K and V vectors are retrieved directly from the cache.
- The new token’s own K and V vectors are calculated and appended to the cache.
-
Result
The attention computation for each new token scales linearly O(n) rather than quadratically, yielding much faster inference.
Trade‑off
The primary trade‑off is increased GPU memory usage: the cached K and V tensors can dominate VRAM consumption for very long sequences or large batch sizes.

Source: (Sebastian Raschka, PhD)
Explanation Through Sample Conceptual Code
Below is a minimal Python example that simulates the attention step while using a KV cache.
# kv_cache_demo.py
KV_CACHE = {
"keys": [], # stored K vectors
"values": [] # stored V vectors
}
def generate_next_token(new_token, sequence_so_far):
"""
Simulates the attention step for a new token, using/updating the KV cache.
"""
print(f"\n--- Processing Token: '{new_token}' ---")
# 1️⃣ Compute Query for the new token
Q_new = f"Q_vec({new_token})"
print(f"1. Computed Query (Q): {Q_new}")
# 2️⃣ Compute Key and Value for the new token only
K_new = f"K_vec({new_token})"
V_new = f"V_vec({new_token})"
print(f"2. Computed Key (K) and Value (V): {K_new}, {V_new}")
# 3️⃣ Build full attention matrices using cached + new vectors
K_full = KV_CACHE["keys"] + [K_new]
V_full = KV_CACHE["values"] + [V_new]
print(f"3. Full Attention Keys (cached + new): {K_full}")
# 4️⃣ Perform (conceptual) attention
attention_output = f"Attention({Q_new}, {K_full}, {V_full})"
print(f"4. Attention Calculation: {attention_output}")
# 5️⃣ Update the cache
KV_CACHE["keys"].append(K_new)
KV_CACHE["values"].append(V_new)
print(f"5. KV Cache updated – size now: {len(KV_CACHE['keys'])} tokens")
return "Predicted_Token"
# ---- Demo ----
print("=== Initial Prompt Phase: 'Hello, world' ===")
prompt_tokens = ["Hello,", "world"]
# Process prompt tokens
generate_next_token(prompt_tokens[0], [])
generate_next_token(prompt_tokens[1], prompt_tokens[:1])
print("\n=== Generation Phase: Predicting the 3rd token ===")
next_token = "(Model predicts 'how')"
generate_next_token(next_token, prompt_tokens)
Key Takeaways from the Code
- Cache Utilization: When processing a new token, the model re‑uses
KV_CACHE['keys']andKV_CACHE['values'], which already contain vectors for all previous tokens. - Minimal Computation: Only the query, key, and value for the newest token are computed at each step.
- Efficiency: Without the cache, the model would need to recompute K and V for every token in the entire history at every step, leading to redundant work.
The concepts demonstrated here can be extended to a full PyTorch implementation of multi‑head attention, where the cache is managed as tensors (self.cache_k, self.cache_v) and the query size remains constant.