Why Care About Prompt Caching in LLMs?

Published: 1 month ago (March 13, 2026 at 01:09 PM EDT)

10 min read

Source: Towards Data Science

Scaling Costs and Latency in RAG and AI Agents

We’ve talked a lot about what an incredible tool RAG is for leveraging the power of AI on custom data. Whether we’re dealing with plain LLM API requests, RAG applications, or more complex AI agents, one question remains constant:

How do these systems scale?
Specifically, what happens to cost and latency as the number of requests grows?

For advanced AI agents—which may contain multiple calls to an LLM for processing a single user query—these concerns become especially important.

Why Caching Matters

In practice, many input tokens are repeated across multiple requests:

Users often ask the same specific questions repeatedly.
System prompts and instructions are included in every query.
Even a single prompt triggers recursive token‑by‑token generation, which repeats the same context many times.

Applying the caching concept can dramatically reduce both cost and latency. According to the OpenAI documentation on Prompt Caching:

Latency can be cut by up to 80 %.
Input‑token costs can drop by up to 90 %.

Takeaway

By caching repeated prompt components and leveraging OpenAI’s prompt‑caching features, you can make your RAG pipelines and AI agents far more efficient as they scale. This optimization is essential for maintaining performance and controlling expenses in production‑grade AI applications.

What About Caching?

In general, caching is not a new idea in computing. At its core, a cache temporarily stores data so that future requests for the same data can be served faster. This yields two basic outcomes:

Cache hit – the requested data is found in the cache, allowing for a quick and cheap retrieval.
Cache miss – the data is not in the cache, forcing the application to access the original source, which is more expensive and time‑consuming.

A Typical Example: Web Browsers

First visit – The browser checks its cache for the URL.
- Result: cache miss → the browser must request the page from the remote server.
After the page loads – The browser stores the retrieved resources in its local cache.
Subsequent visit (e.g., 5 minutes later) – The browser looks for the page in its cache again.
- Result: cache hit → the page loads instantly without contacting the server.

This mechanism makes browsing faster and reduces network traffic.

Why Caching Is So Effective

Most systems do not access data uniformly; instead, they follow a skewed distribution where a small fraction of the data accounts for the majority of requests. Many real‑world applications obey the Pareto principle: roughly 80 % of the requests target 20 % of the data.

If requests were uniformly distributed, cache memory would need to be as large as primary memory, making it prohibitively expensive. By exploiting this skew, a relatively small cache can satisfy a large proportion of accesses, dramatically improving performance and reducing cost.

Prompt Caching and a Little Bit About LLM Inference

The caching concept – storing frequently used data somewhere and retrieving it from there instead of obtaining it again from its primary source – is used to improve the efficiency of LLM calls, dramatically reducing cost and latency.
Caching can be applied to many parts of an AI application, the most important of which is prompt caching. It can also be useful for other aspects, such as RAG retrieval or query‑response caching, but this post focuses solely on prompt caching.

How LLM Inference Works

LLM inference (using a trained model to generate text) is divided into two distinct stages:

Stage	What Happens	Primary Bottleneck
Pre‑fill	The entire prompt is processed at once to produce the first token.	Compute‑bound – heavy matrix multiplications.
Decoding	The last generated token is appended to the sequence and the next token is generated auto‑regressively.	Memory‑bound – the full context must be loaded from memory for each new token.

Example

Prompt:

What should I cook for dinner?

First token (after pre‑fill):

Here

Decoding iterations:

Here 
Here are 
Here are 5 
Here are 5 easy 
Here are 5 easy dinner 
Here are 5 easy dinner ideas

During decoding the model repeatedly re‑processes the same previous tokens, which is highly inefficient.

KV (Key‑Value) Caching

KV caching solves the inefficiency above by storing the intermediate key and value tensors for the prompt and already‑generated tokens. On each decoding step the model:

Retrieves the cached KV tensors for the existing context.
Computes only the new token’s KV tensors.
Appends the new KV pair to the cache.

Thus, the model performs the minimum required computation for every new token.

KV caching, however, works only for a single prompt and a single response.

Prompt Caching

Prompt caching extends KV caching across different prompts, users, and sessions. The idea is simple:

Identify the repeated prefix of a prompt (e.g., system prompt, instructions, retrieved context).
Compute its KV tensors once and store them.
Re‑use the stored KV tensors whenever a new request contains the same prefix.

Benefits

Cost reduction – you don’t pay for recomputing identical tokens.
Latency reduction – the model skips work that has already been done.
Particularly valuable for RAG pipelines or any application with large, repeated instructions.

Token‑Level Operation

Caching works at the token level. As long as two prompts share the same token prefix, the shared portion can be served from the cache, even if the suffixes differ. The shared tokens must be at the start of the prompt; otherwise a cache miss occurs.

Example – Cache Hit

Prompt 1
What should I cook for dinner?

Prompt 2
What should I cook for lunch?

The shared prefix “What should I cook” yields a cache hit, saving computation for Prompt 2.

Example – Cache Miss

Prompt 1
Dinner time! What should I cook?

Prompt 2
Launch time! What should I cook?

Because the first tokens differ (“Dinner” vs. “Launch”), the cache cannot be reused, even though the semantics are similar.

Practical Rule of Thumb

Static information (system prompts, instructions, retrieved context) → place at the beginning of the model input.
Variable information (timestamps, user IDs, user‑specific queries) → place at the end of the prompt.

Following this ordering maximizes the chance of cache hits and lets you reap the full benefits of prompt caching.

Getting Our Hands Dirty with the OpenAI API

Most frontier foundation models—such as GPT (OpenAI docs) and Claude (Claude cookbook)—offer Prompt Caching directly in their APIs. In these APIs the cache is shared across all users of an organization that use the same API key.

When a request is made, the model stores the prompt prefix in the cache.
Subsequent requests that contain the same prefix hit the cache, allowing the model to reuse pre‑computed calculations.
This reduces token consumption and speeds up response generation—especially valuable for enterprise‑scale AI applications where many users repeatedly send similar prompts.

Cache‑Retention Options

Retention type	Typical duration	Availability
In‑memory prompt cache	~5 – 10 minutes (up to 1 hour)	All models that support caching
Extended prompt cache	Up to 24 hours	Only on specific models (e.g., GPT‑5.2)

Note: On most recent models Prompt Caching is enabled by default, but you can still tweak the retention settings.

A Minimal Python Example

Below is a short script that demonstrates Prompt Caching with the OpenAI API. The example uses a very large shared prefix so the caching effect becomes obvious.

from openai import OpenAI

api_key = "your_api_key"
client = OpenAI(api_key=api_key)

# ----------------------------------------------------------------------
# A huge prompt prefix (repeated 80×) to push the token count above the
# 1 024‑token threshold required for caching.
# ----------------------------------------------------------------------
prefix = """
You are a helpful cooking assistant.

Your task is to suggest simple, practical dinner ideas for busy people.
Follow these guidelines carefully when generating suggestions:

General cooking rules:
- Meals should take less than 30 minutes to prepare.
- Ingredients should be easy to find in a regular supermarket.
- Recipes should avoid overly complex techniques.
- Prefer balanced meals including vegetables, protein, and carbohydrates.

Formatting rules:
- Always return a numbered list.
- Provide 5 suggestions.
- Each suggestion should include a short explanation.

Ingredient guidelines:
- Prefer seasonal vegetables.
- Avoid exotic ingredients.
- Assume the user has basic pantry staples such as olive oil, salt, pepper, garlic, onions, and pasta.

Cooking philosophy:
- Favor simple home cooking.
- Avoid restaurant‑level complexity.
- Focus on meals that people realistically cook on weeknights.

Example meal styles:
- pasta dishes
- rice bowls
- stir fry
- roasted vegetables with protein
- simple soups
- wraps and sandwiches
- sheet‑pan meals

Diet considerations:
- Default to healthy meals.
- Avoid deep frying.
- Prefer balanced macronutrients.

Additional instructions:
- Keep explanations concise.
- Avoid repeating the same ingredients in every suggestion.
- Provide variety across the meal suggestions.
""" * 80   # repeat to exceed the caching threshold

# --------------------------------------------------------------
# Prompt 1 – first request (populates the cache)
# --------------------------------------------------------------
prompt1 = prefix + "What should I cook for dinner?"
response1 = client.responses.create(
    model="gpt-5.2",
    input=prompt1
)

print("\nResponse 1:")
print(response1.output_text)
print("\nUsage stats:")
print(response1.usage)

# --------------------------------------------------------------
# Prompt 2 – second request (reuses the cached prefix)
# --------------------------------------------------------------
prompt2 = prefix + "What should I cook for lunch?"
response2 = client.responses.create(
    model="gpt-5.2",
    input=prompt2
)

print("\nResponse 2:")
print(response2.output_text)
print("\nUsage stats:")
print(response2.usage)

What Happens Under the Hood?

Prompt 1 consumes the full token count (≈ 20 014 tokens).
Prompt 2 reuses the cached prefix, so only the non‑identical part of the prompt is billed.
- Tokens charged ≈ 20 014 – 19 840 = 174 tokens (≈ 99 % savings).

When Does Prompt Caching Pay Off?

OpenAI activates caching only after a 1 024‑token minimum is reached, and the cache can be retained for up to 24 hours (extended mode). Consequently, the biggest cost and latency benefits appear in large‑scale deployments where:

Many users interact with the same application daily.
Prompt prefixes are long and frequently repeated.

In such scenarios, Prompt Caching can dramatically lower token usage and improve response times for LLM‑powered applications.

On My Mind

Prompt caching is a powerful optimization for LLMs that can significantly improve the efficiency of AI applications—both in terms of cost and time. By reusing previous computations for identical prompt prefixes, the model can skip redundant calculations and avoid repeatedly processing the same input tokens. The result is faster responses and lower costs, especially in applications where large parts of prompts (e.g., system instructions or retrieved context) remain constant across many requests.

As AI systems scale and the number of LLM calls increases, these optimizations become increasingly important.

Connect with Me

📰 Substack
💌 Medium
💼 LinkedIn
☕ Buy me a coffee

All images are by the author unless otherwise noted.