Can I Buy Your KV Cache?

Published: (June 12, 2026 at 04:14 PM EDT)
2 min read

Source: Hacker News

View PDF HTML (experimental) Abstract:Right now, across the world, AI agents are repeating the same absurd act: to read one document, they each recompute it from scratch. Every agent re-runs prefill, the most compute-intensive step a large model takes, over identical text, only to rebuild a key-value (KV) cache identical to the one the agent before it just built. The same answer, computed a million times. We make a proposal that is almost offensively simple: compute it once. Let a publisher precompute a document’s KV cache, and let every other agent buy the right to load it and skip prefill. It works, and it is token-exact: loading a precomputed KV and continuing matches prefilling from scratch (24/24 greedy tokens, and at the logits level), with no accuracy cost. On Qwen3-4B, reuse is 9-50x cheaper in compute than prefill, and the gap widens with length (prefill’s attention scales with L^2), so a single reuse already pays it back. Then the part that matters: where the KV lives. Shipping it fails, because KV is nearly incompressible, so per-load egress costs more than the prefill it saves. Hosting it provider-side, exactly as production prompt-caching works, removes egress entirely. The size of the prize is set by our measured compute saving: serving one hot 3774-token document to 80M agents costs ~$1.5M to re-prefill but only ~$0.03M of reuse compute (49.7x less). The 0.1x cache-read tariff APIs charge passes a 10x discount to users while sitting inside this measured envelope, so the 10x is a floor that the measured ~50x compute saving clears, and the gap to the physical ~50x is provider margin: millions of dollars per popular document. We frame the resulting agent-native prefill CDN and leave lossless KV compression and a cross-party payment layer as the open problems.

      Subjects:
      
        Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Multiagent Systems (cs.MA)
    
      Cite as:
      [arXiv:2606.13361](https://arxiv.org/abs/2606.13361) [cs.AI]
    
    
       
      (or 
          [arXiv:2606.13361v1](https://arxiv.org/abs/2606.13361v1) [cs.AI] for this version)
      
    
    
       
                    [https://doi.org/10.48550/arXiv.2606.13361](https://doi.org/10.48550/arXiv.2606.13361)
          
          
              arXiv-issued DOI via DataCite (pending registration)

        
      
    


  

Submission history

From: Luoyuan Zhang [view email]
[v1] Thu, 11 Jun 2026 13:47:33 UTC (113 KB)

0 views
Back to Blog

Related posts

Read more »

Chaosnet (1981)

1 Introduction ¶Introduction Chaosnet is a local network, that is, a system for communication among a group of computers located within one or two kilometers o...

Rome Fell and Nobody Noticed

When I first began learning about the Roman Empire in middle school, I was most interested in what everyone else seems to be interested in — the time of Caesar...