[Paper] Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks
Source: arXiv - 2601.06007v1
Overview
The paper “Don’t Break the Cache: An Evaluation of Prompt Caching for Long‑Horizon Agentic Tasks” investigates how prompt‑caching—an optimization offered by major LLM providers—behaves when LLMs act as autonomous agents that must call external tools (e.g., web search) over many turns. By measuring real‑world costs and latency on a large benchmark, the authors show that smart caching can slash API bills by up to 80 % while also speeding up the first token response.
Key Contributions
- First systematic quantification of prompt‑caching savings for multi‑turn, tool‑calling LLM agents.
- Cross‑provider comparison (OpenAI, Anthropic, Google) covering three distinct caching strategies:
- Full‑context caching (everything cached).
- System‑prompt‑only caching.
- Dynamic‑content‑excluded caching (static prompt + tool results omitted).
- Large‑scale empirical evaluation on DeepResearchBench (≈ 500 agent sessions, > 10 k‑token system prompts).
- Practical guidelines for arranging prompts and cache blocks to avoid “cache‑induced latency spikes.”
- Open‑source release of the benchmarking scripts and a detailed cost/latency analysis per provider.
Methodology
- Benchmark selection – DeepResearchBench consists of realistic research‑question answering tasks where an LLM agent must repeatedly invoke a web‑search tool, parse results, and synthesize an answer.
- Prompt design – Each session uses a ~10 k‑token system prompt that encodes task instructions, tool schemas, and a few static examples.
- Caching strategies –
- Full‑context: the entire prompt (system + user + tool results) is cached after the first call.
- System‑prompt‑only: only the static system prompt is cached; dynamic user turns and tool outputs are sent fresh each turn.
- Dynamic‑exclusion: static prompt cached, but any block that contains tool results is deliberately excluded from the cache.
- Metrics – For every turn the authors record:
- API cost (token‑based pricing).
- Time‑to‑first‑token (TTFT) as a latency proxy.
- Scale – Over 10 000 total API calls across the three providers, ensuring statistical significance.
Results & Findings
| Provider | Strategy | Avg. Cost Reduction | Avg. TTFT Improvement |
|---|---|---|---|
| OpenAI | Dynamic‑exclusion | ≈ 78 % | +31 % |
| Anthropic | System‑prompt‑only | ≈ 65 % | +24 % |
| Dynamic‑exclusion | ≈ 45 % | +13 % |
- Full‑context caching sometimes increased TTFT because the cache stored large blocks of dynamic tool output, forcing the model to re‑process irrelevant data each turn.
- Placing dynamic content at the end of the system prompt (so it can be excluded from the cache) yielded the most stable performance.
- The magnitude of savings varied by provider due to differences in how each service implements cache invalidation and token‑pricing granularity.
Practical Implications
- Cost‑effective agents – Production systems that run thousands of autonomous LLM agents (e.g., research assistants, automated help desks) can cut operational spend dramatically by enabling prompt caching and structuring prompts as recommended.
- Latency‑critical UX – Faster TTFT translates to snappier user experiences, especially important for real‑time assistants or chat‑based IDE plugins.
- Prompt engineering checklist:
- Keep the static system prompt separate from any turn‑specific or tool‑generated text.
- Append dynamic tool results after the cached block or store them in a separate “scratchpad” that is not cached.
- Use provider‑specific cache‑control flags (e.g.,
cache_controlin OpenAI’s API) to explicitly exclude volatile sections.
- Infrastructure simplification – Since caching is handled by the provider, developers don’t need to build custom memoization layers; they only need to format prompts correctly.
Limitations & Future Work
- The study focuses on single‑agent, single‑task workloads; multi‑agent collaboration or branching conversations may exhibit different caching dynamics.
- Only three commercial providers were examined; emerging open‑source LLM serving stacks (e.g., vLLM, Llama‑cpp) could behave differently.
- The benchmark uses a 10 k‑token system prompt, which is larger than typical production prompts; results for smaller prompts may show diminished relative savings.
- Future research could explore adaptive caching policies that automatically toggle cache blocks based on observed latency or cost trends, and extend the evaluation to other tool types (e.g., code execution, database queries).
Authors
- Elias Lumer
- Faheem Nizar
- Akshaya Jangiti
- Kevin Frank
- Anmol Gulati
- Mandar Phadate
- Vamse Kumar Subbiah
Paper Information
- arXiv ID: 2601.06007v1
- Categories: cs.CL
- Published: January 9, 2026
- PDF: Download PDF