Running AI models is turning into a memory game

Published: (February 17, 2026 at 11:44 AM EST)
3 min read
Source: TechCrunch

Source: TechCrunch

When we talk about the cost of AI infrastructure, the focus is usually on Nvidia and GPUs — but memory is an increasingly important part of the picture. As hyperscalers prepare to build out billions of dollars worth of new data centers, the price for DRAM chips has jumped roughly 7× in the last year (source).

Memory Orchestration

At the same time, a growing discipline is emerging around orchestrating that memory to ensure the right data reaches the right agent at the right time. Companies that master this can make the same queries with fewer tokens, which can be the difference between folding and staying in business.

Semiconductor analyst Dan O’Laughlin discusses the importance of memory chips on his Substack, speaking with Val Bercovici, chief AI officer at Weka. Their conversation focuses on chips rather than broader architecture, but the implications for AI software are significant.

Anthropic Prompt‑Caching

A passage that stood out describes the growing complexity of Anthropic’s prompt‑caching documentation:

“The tell is if we go to Anthropic’s prompt caching pricing page. It started off as a very simple page six or seven months ago, especially as Claude Code was launching — just ‘use caching, it’s cheaper.’ Now it’s an encyclopedia of advice on exactly how many cache writes to pre‑buy. You’ve got 5‑minute tiers, which are very common across the industry, or 1‑hour tiers — and nothing above. That’s a really important tell. Then of course you’ve got all sorts of arbitrage opportunities around the pricing for cache reads based on how many cache writes you’ve pre‑purchased.”
Val Bercovici, interview with Dan O’Laughlin

The key question is how long Claude holds a prompt in cached memory. Users can pay for a 5‑minute window or a longer hour‑long window. Drawing on data that remains in the cache is much cheaper, but every new piece of data added to a query may evict something else from the cache window.

Takeaway: Managing memory in AI models will be a huge part of AI’s future. Companies that excel at it will rise to the top.

Progress in Cache Optimization

In October, a startup called TensorMesh was highlighted for working on a layer of the stack known as cache‑optimization (TechCrunch article).

Opportunities Across the Stack

  • Lower‑level hardware: Decisions about when to use DRAM versus HBM are deep hardware considerations that affect overall efficiency.
  • Higher‑level orchestration: End users are experimenting with structuring model swarms to take advantage of shared caches.

As companies improve memory orchestration, they will use fewer tokens, making inference cheaper. Simultaneously, models are becoming more efficient at processing each token (Ramp analysis), further driving down costs. As server expenses decline, applications that are currently marginal may become profitable.

0 views
Back to Blog

Related posts

Read more »