[Paper] Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks

Published: 1 month ago (January 9, 2026 at 01:41 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2601.06007v1

Overview

The paper “Don’t Break the Cache: An Evaluation of Prompt Caching for Long‑Horizon Agentic Tasks” investigates how prompt‑caching—an optimization offered by major LLM providers—behaves when LLMs act as autonomous agents that must call external tools (e.g., web search) over many turns. By measuring real‑world costs and latency on a large benchmark, the authors show that smart caching can slash API bills by up to 80 % while also speeding up the first token response.

Key Contributions

First systematic quantification of prompt‑caching savings for multi‑turn, tool‑calling LLM agents.
Cross‑provider comparison (OpenAI, Anthropic, Google) covering three distinct caching strategies:
1. Full‑context caching (everything cached).
2. System‑prompt‑only caching.
3. Dynamic‑content‑excluded caching (static prompt + tool results omitted).
Large‑scale empirical evaluation on DeepResearchBench (≈ 500 agent sessions, > 10 k‑token system prompts).
Practical guidelines for arranging prompts and cache blocks to avoid “cache‑induced latency spikes.”
Open‑source release of the benchmarking scripts and a detailed cost/latency analysis per provider.

Methodology

Benchmark selection – DeepResearchBench consists of realistic research‑question answering tasks where an LLM agent must repeatedly invoke a web‑search tool, parse results, and synthesize an answer.
Prompt design – Each session uses a ~10 k‑token system prompt that encodes task instructions, tool schemas, and a few static examples.
Caching strategies –
- Full‑context: the entire prompt (system + user + tool results) is cached after the first call.
- System‑prompt‑only: only the static system prompt is cached; dynamic user turns and tool outputs are sent fresh each turn.
- Dynamic‑exclusion: static prompt cached, but any block that contains tool results is deliberately excluded from the cache.
Metrics – For every turn the authors record:
- API cost (token‑based pricing).
- Time‑to‑first‑token (TTFT) as a latency proxy.
Scale – Over 10 000 total API calls across the three providers, ensuring statistical significance.

Results & Findings

Provider	Strategy	Avg. Cost Reduction	Avg. TTFT Improvement
OpenAI	Dynamic‑exclusion	≈ 78 %	+31 %
Anthropic	System‑prompt‑only	≈ 65 %	+24 %
Google	Dynamic‑exclusion	≈ 45 %	+13 %

Full‑context caching sometimes increased TTFT because the cache stored large blocks of dynamic tool output, forcing the model to re‑process irrelevant data each turn.
Placing dynamic content at the end of the system prompt (so it can be excluded from the cache) yielded the most stable performance.
The magnitude of savings varied by provider due to differences in how each service implements cache invalidation and token‑pricing granularity.

Practical Implications

Cost‑effective agents – Production systems that run thousands of autonomous LLM agents (e.g., research assistants, automated help desks) can cut operational spend dramatically by enabling prompt caching and structuring prompts as recommended.
Latency‑critical UX – Faster TTFT translates to snappier user experiences, especially important for real‑time assistants or chat‑based IDE plugins.
Prompt engineering checklist:
1. Keep the static system prompt separate from any turn‑specific or tool‑generated text.
2. Append dynamic tool results after the cached block or store them in a separate “scratchpad” that is not cached.
3. Use provider‑specific cache‑control flags (e.g., cache_control in OpenAI’s API) to explicitly exclude volatile sections.
Infrastructure simplification – Since caching is handled by the provider, developers don’t need to build custom memoization layers; they only need to format prompts correctly.

Limitations & Future Work

The study focuses on single‑agent, single‑task workloads; multi‑agent collaboration or branching conversations may exhibit different caching dynamics.
Only three commercial providers were examined; emerging open‑source LLM serving stacks (e.g., vLLM, Llama‑cpp) could behave differently.
The benchmark uses a 10 k‑token system prompt, which is larger than typical production prompts; results for smaller prompts may show diminished relative savings.
Future research could explore adaptive caching policies that automatically toggle cache blocks based on observed latency or cost trends, and extend the evaluation to other tool types (e.g., code execution, database queries).

Authors

Elias Lumer
Faheem Nizar
Akshaya Jangiti
Kevin Frank
Anmol Gulati
Mandar Phadate
Vamse Kumar Subbiah

Paper Information

arXiv ID: 2601.06007v1
Categories: cs.CL
Published: January 9, 2026
PDF: Download PDF

[Paper] Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs

[Paper] Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards

[Paper] The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning

[Paper] Distilling Feedback into Memory-as-a-Tool