이론에서 실천으로: 현대 LLM의 Key-Value 캐시 해부

발행: 1시간 전 (2025년 12월 5일 오후 09:51 GMT+9)

5 min read

Source: Dev.to

Introduction — What is Key‑Value Cache and Why We Need It?

KV 캐시 일러스트레이션

My Journey into the LLM Landscape

전통적인 데이터 사이언스나 딥러닝 배경은 없지만, 지난 몇 년간 AI와 생성 모델을 다루면서 KV 캐시와 같은 개념을 실용적으로 배우게 되었습니다. 블로그와 기술 서적을 읽고 샘플 코드를 실험하면서 말이죠. 이런 실습 중심 접근법은 수학적 아이디어를 구체적이고 작동 가능한 컴포넌트로 변환하는 데 도움이 됩니다.

Source: NVIDIA blog

KV Cache (Key‑Value Cache)는 이미 처리된 토큰에 대해 self‑attention 메커니즘이 생성한 중간 Key (K) 와 Value (V) 벡터를 저장하는 전용 메모리 공간입니다. 이후 추론 단계에서 이 벡터들을 재사용함으로써 생성 속도를 크게 높일 수 있습니다.

Why KV Cache Is Necessary

Transformer 모델(예: GPT 계열)은 텍스트를 자동 회귀 방식으로 생성합니다. 즉, 새로운 토큰은 이전에 생성된 모든 토큰을 기반으로 예측됩니다. KV 캐시가 없으면 모델은 매 단계마다 전체 시퀀스에 대해 K와 V를 다시 계산해야 하므로 계산 비용이 시퀀스 길이 n에 대해 O(n²) 로 급격히 증가합니다. 이는 장거리 생성이 매우 느리고 비용이 많이 들게 만듭니다.

How KV Cache Works

Prefill Phase (First Token / Prompt)
모델이 입력 프롬프트를 처리하면서 모든 토큰에 대해 Q, K, V 를 계산합니다. K와 V 벡터는 KV 캐시에 저장됩니다.
Decode Phase (Subsequent Tokens)
- 새로 생성된 토큰에 대해서만 Query (Q) 가 계산됩니다.
- 이전에 계산된 K와 V 벡터는 캐시에서 바로 가져옵니다.
- 새 토큰의 K와 V 벡터도 계산되어 캐시에 추가됩니다.
Result
각 새로운 토큰에 대한 attention 연산은 O(n) 의 선형 복잡도로 동작하게 되어, 훨씬 빠른 추론이 가능합니다.

Trade‑off

주된 트레이드오프는 GPU 메모리 사용량 증가입니다. 매우 긴 시퀀스나 큰 배치 크기의 경우, 캐시된 K와 V 텐서가 VRAM을 크게 차지할 수 있습니다.

KV 캐시 메모리 트레이드‑오프
Source: (Sebastian Raschka, PhD)

Explanation Through Sample Conceptual Code

Below is a minimal Python example that simulates the attention step while using a KV cache.

# kv_cache_demo.py
KV_CACHE = {
    "keys": [],   # stored K vectors
    "values": []  # stored V vectors
}

def generate_next_token(new_token, sequence_so_far):
    """
    Simulates the attention step for a new token, using/updating the KV cache.
    """
    print(f"\n--- Processing Token: '{new_token}' ---")

    # 1️⃣ Compute Query for the new token
    Q_new = f"Q_vec({new_token})"
    print(f"1. Computed Query (Q): {Q_new}")

    # 2️⃣ Compute Key and Value for the new token only
    K_new = f"K_vec({new_token})"
    V_new = f"V_vec({new_token})"
    print(f"2. Computed Key (K) and Value (V): {K_new}, {V_new}")

    # 3️⃣ Build full attention matrices using cached + new vectors
    K_full = KV_CACHE["keys"] + [K_new]
    V_full = KV_CACHE["values"] + [V_new]
    print(f"3. Full Attention Keys (cached + new): {K_full}")

    # 4️⃣ Perform (conceptual) attention
    attention_output = f"Attention({Q_new}, {K_full}, {V_full})"
    print(f"4. Attention Calculation: {attention_output}")

    # 5️⃣ Update the cache
    KV_CACHE["keys"].append(K_new)
    KV_CACHE["values"].append(V_new)
    print(f"5. KV Cache updated – size now: {len(KV_CACHE['keys'])} tokens")

    return "Predicted_Token"

# ---- Demo ----
print("=== Initial Prompt Phase: 'Hello, world' ===")
prompt_tokens = ["Hello,", "world"]

# Process prompt tokens
generate_next_token(prompt_tokens[0], [])
generate_next_token(prompt_tokens[1], prompt_tokens[:1])

print("\n=== Generation Phase: Predicting the 3rd token ===")
next_token = "(Model predicts 'how')"
generate_next_token(next_token, prompt_tokens)

Key Takeaways from the Code

Cache Utilization: When processing a new token, the model re‑uses KV_CACHE['keys'] and KV_CACHE['values'], which already contain vectors for all previous tokens.
Minimal Computation: Only the query, key, and value for the newest token are computed at each step.
Efficiency: Without the cache, the model would need to recompute K and V for every token in the entire history at every step, leading to redundant work.

The concepts demonstrated here can be extended to a full PyTorch implementation of multi‑head attention, where the cache is managed as tensors (self.cache_k, self.cache_v) and the query size remains constant.

이론에서 실천으로: 현대 LLM의 Key-Value 캐시 해부

Introduction — What is Key‑Value Cache and Why We Need It?

My Journey into the LLM Landscape

Why KV Cache Is Necessary

How KV Cache Works

Trade‑off

Explanation Through Sample Conceptual Code

Key Takeaways from the Code

관련 글

AI 에이전트를 조율하여 밈 만들기

Google, Gemini 3 Deep Think를 AI Ultra에 출시

AI 컨트리 음악 폭발에 대비하세요

AI 챗봇의 웹 검색 뒤에 있는 아키텍처