'Observational memory' cuts AI agent costs 10x and outscores RAG on long-context benchmarks

Published: 2 days ago (February 10, 2026 at 04:30 PM EST)

7 min read

Source: VentureBeat

RAG Limitations in Modern Agentic AI Workflows

RAG isn’t always fast enough or intelligent enough for modern agentic AI workflows. As teams move from short‑lived chatbots to long‑running, tool‑heavy agents embedded in production systems, those limitations are becoming harder to work around.

In response, teams are experimenting with alternative memory architectures — sometimes called contextual memory or agentic memory — that prioritize persistence and stability over dynamic retrieval.

Observational Memory

One of the more recent implementations of this approach is “observational memory,” an open‑source technology developed by Mastra, which was founded by the engineers who previously built and sold the Gatsby framework to Netlify.

Unlike RAG systems that retrieve context dynamically, observational memory uses two background agents (Observer and Reflector) to compress conversation history into a dated observation log.
The compressed observations stay in context, eliminating retrieval entirely.
- For text content, the system achieves 3‑6× compression.
- For tool‑heavy agent workloads generating large outputs, compression ratios hit 5‑40×.

Trade‑off: Observational memory prioritizes what the agent has already seen and decided over searching a broader external corpus, making it less suitable for open‑ended knowledge discovery or compliance‑heavy recall use cases.

Benchmark Results

Model	LongMemEval Score
GPT‑5‑mini (observational memory)	94.87 %
GPT‑4o (observational memory)	84.23 %
Mastra’s own RAG implementation	80.05 %

“It has this great characteristic of being both simpler and it is more powerful, like it scores better on the benchmarks,” — Sam Bhagwat, co‑founder & CEO of Mastra (VentureBeat)

How It Works: Two Agents Compress History into Observations

The architecture is simpler than traditional memory systems but delivers better results.

Context Window Split
- Observations block – compressed, dated notes extracted from previous conversations.
- Raw message block – the current session’s uncompressed messages.
Background Agents
- Observer – when unobserved messages reach 30 000 tokens (configurable), it compresses them into new observations and appends them to the observations block, then drops the original messages.
- Reflector – when the observations block reaches 40 000 tokens (also configurable), it restructures and condenses the log, combining related items and removing superseded information.

“The way that you’re sort of compressing these messages over time is you’re actually just sort of getting messages, and then you have an agent sort of say, ‘OK, so what are the key things to remember from this set of messages?’” — Sam Bhagwat

The format is text‑based, not structured objects. No vector databases or graph databases are required.

Stable Context Windows Cut Token Costs Up to 10×

The economics of observational memory come from prompt caching. Providers such as Anthropic and OpenAI reduce token costs by 4‑10× for cached prompts versus uncached ones. Most memory systems can’t take advantage of this because they inject dynamically retrieved context each turn, invalidating the cache.

Why Observational Memory Is Cache‑Friendly

The observations block is append‑only until a reflection run.
The system prompt + existing observations form a consistent prefix that can be cached across many turns.
New messages are appended to the raw history block until the 30 000‑token threshold is hit. Every turn before that is a full cache hit.
When an observation run occurs, messages are replaced with new observations, but the observation prefix stays consistent, yielding a partial cache hit.
Only during a reflection run (which is infrequent) is the entire cache invalidated.

The average context window size for Mastra’s LongMemEval benchmark run was around 30 000 tokens, far smaller than the full conversation history would require.

How This Differs From Traditional Compaction

Traditional Compaction	Observational Memory
Process: Fill the context window, then compress the entire history into a summary when overflow is imminent.	Process: Observer runs frequently on smaller chunks, producing an event‑based decision log.
Result: Documentation‑style summaries that capture the gist but lose specific events, decisions, and tool interactions.	Result: Dated, prioritized observations that retain specific decisions and actions.
Cost: Large‑batch compression is computationally expensive and often discards details needed for consistent agent behavior.	Cost: Smaller, more frequent compressions are cheaper and preserve actionable details.
Structure: Summaries become a single blob.	Structure: Event‑based log persists; Reflector only reorganizes and condenses it, never turning it into a blob.

The log reads like a chronological record of decisions and actions, not a high‑level documentation summary.

Enterprise Use Cases: Long‑Running Agent Conversations

Mastra’s customers span several categories:

Category	Example Use Cases
In‑app chatbots	Integrated with CMS platforms like Sanity or Contentful.
AI SRE systems	Help engineering teams triage alerts and incidents.
Document processing agents	Automate paperwork for traditional businesses moving toward digital workflows.

Common requirement: Long‑running, stable context that can persist across many interactions without costly recomputation.

Observational Memory for Long‑Term Context

“One of the big goals for 2025 and 2026 has been building an agent inside their web app,” Bhagwat said about B2B SaaS companies.
“That agent needs to be able to remember that, like, three weeks ago, you asked me about this thing, or you said you wanted a report on this kind of content type, or views segmented by this metric.”

Why Memory Becomes a Product Requirement

Cross‑session continuity – Agents embedded in content‑management systems must recall that a user requested a specific report format weeks earlier.
Incident tracking – SRE agents need to remember which alerts were investigated and what decisions were made.
User experience – Forgetting prior decisions or preferences is immediately noticeable and degrades trust.

Observational memory keeps months of conversation history present and accessible, allowing agents to respond without forcing users to re‑explain preferences or previous decisions.

Recent Release

The system shipped as part of Mastra 1.0 and is now available. This week the team released plug‑ins for:

LangChain
Vercel’s AI SDK
Other popular frameworks

These plug‑ins let developers use observational memory outside the Mastra ecosystem.

What It Means for Production AI Systems

Observational memory offers a different architectural approach from the vector‑database‑and‑RAG pipelines that dominate current implementations.

Benefit	Description
Simpler architecture	Text‑based, no specialized databases → easier to debug and maintain
Stable context window	Enables aggressive caching → reduces inference costs
Scalable performance	Benchmarks show the approach works at scale

Key Questions for Enterprise Teams

How much context do your agents need to maintain across sessions?
What is your tolerance for lossy compression versus full‑corpus search?
Do you need the dynamic retrieval that RAG provides, or would stable context work better?
Are your agents tool‑heavy, generating large amounts of output that needs compression?

The answers determine whether observational memory fits your use case.

Bhagwat positions memory as one of the top primitives needed for high‑performing agents, alongside:

Tool use
Workflow orchestration
Observability
Guardrails

For enterprise agents embedded in products, forgetting context between sessions is unacceptable. Users expect agents to remember their preferences, previous decisions, and ongoing work.

“The hardest thing for teams building agents is the production, which can take time,” Bhagwat said.
“Memory is a really important bit in that, because it’s just jarring if you use any sort of agentic tool and you sort of told it something and then it just kind of forgot it.”

Looking Ahead

As agents move from experiments to embedded systems of record, how teams design memory may matter as much as which model they choose.

'Observational memory' cuts AI agent costs 10x and outscores RAG on long-context benchmarks

RAG Limitations in Modern Agentic AI Workflows

Observational Memory

Benchmark Results

How It Works: Two Agents Compress History into Observations

Stable Context Windows Cut Token Costs Up to 10×

Why Observational Memory Is Cache‑Friendly

How This Differs From Traditional Compaction

Enterprise Use Cases: Long‑Running Agent Conversations

Observational Memory for Long‑Term Context

Why Memory Becomes a Product Requirement

Recent Release

What It Means for Production AI Systems

Key Questions for Enterprise Teams

Looking Ahead

Related posts

A Guide to Fine-Tuning FunctionGemma

Code, Compute and Connection: Inside the Inaugural NVIDIA AI Day São Paulo

I Built a Feedback Loop That Coaches LLMs at Runtime Using NumPy

Why Your “Skill Scanner” Is Just False Security (and Maybe Malware)

RAG Limitations in Modern Agentic AI Workflows

Observational Memory

Benchmark Results

How It Works: Two Agents Compress History into Observations

Stable Context Windows Cut Token Costs Up to 10×

Why Observational Memory Is Cache‑Friendly

How This Differs From Traditional Compaction

Enterprise Use Cases: Long‑Running Agent Conversations

Observational Memory for Long‑Term Context

Why Memory Becomes a Product Requirement

Recent Release

What It Means for Production AI Systems

Key Questions for Enterprise Teams

Looking Ahead

Related posts

A Guide to Fine-Tuning FunctionGemma

Code, Compute and Connection: Inside the Inaugural NVIDIA AI Day São Paulo

I Built a Feedback Loop That Coaches LLMs at Runtime Using NumPy

Why Your “Skill Scanner” Is Just False Security (and Maybe Malware)

Stable Context Windows Cut Token Costs Up to 10×