OpenClaw QMD: Local Hybrid Search for 10x Smarter Memory

Published: 2 months ago (February 22, 2026 at 06:55 AM EST)

6 min read

Source: Dev.to

Source: Dev.to

Why Default Memory Fails at Scale

OpenClaw’s built‑in memory is simple:

Append to MEMORY.md.
Inject the whole file into every prompt.

It works fine at ~500 tokens, but it falls apart at ~5 000.

The problems compound

Issue	Symptom
Token explosion	Every message pays the full context tax. A 10‑token query drags ~4 000 tokens of memory. Your $0.01 API call becomes $0.15.
Relevance collapse	The model sees everything, prioritises nothing. Asking about “database connection pooling” weighs your lunch preferences equally.
No semantic understanding	Keyword matching alone misses synonyms. “DB connection” won’t find notes about “PostgreSQL pooling” unless you used those exact words.
Cloud dependency	Vector search usually means Pinecone, Weaviate, or another hosted service. Your private notes now live on someone else’s servers.

QMD solves all four

Indexes your markdown files locally, runs hybrid retrieval combining three search strategies, and returns only the relevant snippets.

700 characters max per result (default = 6 results).
Your 10 000‑token memory footprint becomes ≈ 200 tokens of gold.

What interviewers are actually testing: Can you explain the token economics of context injection?
Insight: Context length is O(n) cost, but relevance is what matters. Retrieval‑augmented generation (RAG) exists because “just include everything” doesn’t scale.

Hybrid Search – Three Stages

Stage 1 – BM25 (Keyword Matching)

Classic IR: term frequency, inverse document frequency, document‑length normalisation.

Score = Σ IDF(term) × TF(term,doc) × (k₁+1) / (TF + k₁ × (1‑b + b × |doc|/avgdl))

Fast, deterministic, great for exact matches.
Example: searching “SwiftUI navigation” finds docs containing those exact terms.

Limitation: Misses semantic relationships (e.g., “iOS routing” won’t match “SwiftUI navigation”).

Stage 2 – Vector Search (Semantic Matching)

Uses Jina v3 embeddings (local, ~1 GB GGUF model).
Text → 1024‑dimensional vector; similar meanings cluster together.
“iOS routing” lands near “SwiftUI navigation.”

The embedding model downloads automatically on first run. No API keys, no cloud calls. Your notes never leave your machine.

Stage 3 – LLM Reranking (Precision Boost)

After BM25 and vector search return candidates, a local LLM reranks them by actual relevance to your query.

Query: "Ray's SwiftUI style"
├── BM25 candidates (10)
├── Vector candidates (10)
└── LLM reranker → Top 6 results (≈700 chars each)

Catches cases where both keyword and semantic matches miss the point.
Example: a snippet about Ray’s code‑review preferences beats a snippet that merely mentions SwiftUI in passing.

What interviewers are actually testing: Hybrid search is the 2026 standard for production RAG. Pure vector search has recall problems; pure BM25 has semantic problems. The combination + reranking is how you build retrieval that actually works.

Why Run Locally?

Benefit	Reason
Cost	Vector‑search APIs charge per query. QMD is free after the initial model download.
Privacy	Agent memory contains sensitive context (project names, credentials, personal preferences). Keeping it local means keeping it private.
Latency	Network round‑trips add 50‑200 ms per query. Local inference is faster, especially when you run multiple retrievals per turn.

Trade‑off: Compute. You need a machine with enough RAM to load the models (~4 GB recommended). Cloud instances work, but you pay for compute instead of API calls.

What interviewers are actually testing: The build‑vs‑buy decision for ML infrastructure. Local models trade API costs for compute costs. The break‑even point depends on query volume, latency requirements, and privacy constraints. Know your numbers.

QMD Architecture

Rust CLI – fast, single binary, cross‑platform.
GGUF models – quantised for local inference (~1 GB total).
SQLite indexes – BM25 + metadata stored locally.
Jina v3 embeddings – 1024‑dim vectors, multilingual.

On a Mac Mini M2, embedding 1 000 markdown files takes ~30 s. Queries return in ≈ 100 ms.
What interviewers are actually testing: System integration patterns. How do you replace a component (memory backend) without breaking the rest of the system? The answer involves clean interfaces, configuration‑driven switching, and graceful degradation if the new backend fails.

QMD’s Model Context Protocol (MCP) Server

QMD exposes an MCP server so agents can query memory programmatically (e.g., via HTTP/JSON). This enables:

Language‑agnostic access from any agent implementation.
Fine‑grained control over retrieval parameters (max results, char limits, filters).
Graceful fallback to the built‑in memory if the MCP server is unavailable.

TL;DR

Default “inject‑everything” memory doesn’t scale.
Hybrid search (BM25 + vector + LLM rerank) gives cheap, private, low‑latency retrieval.
QMD provides this locally, integrates with OpenClaw via a simple config, and lets you keep your agent’s memory fast, relevant, and secure.

Enabling Self‑Healing Memory Workflows

Example: a compaction skill that prunes outdated entries

// Memory compaction skill
const staleEntries = await qmd.query({
  collection: "agent-logs",
  filter: { olderThan: "30d", accessCount: 0 }
});

for (const entry of staleEntries) {
  if (await confirmDeletion(entry)) {
    await qmd.delete(entry.id);
  }
}

await qmd.reindex();

MCP Interface

Command	Description
`query`	Hybrid search with filters
`add`	Insert new memory entries
`update`	Modify existing entries
`delete`	Remove stale content
`reindex`	Rebuild embeddings after bulk changes

This turns memory from a passive store into an active system. Agents can curate their own context, pruning irrelevant entries and promoting useful ones.

Pattern tip: Run a nightly job that analyzes query patterns, identifies entries that never get retrieved, and archives them. Memory stays lean without manual curation.

What Interviewers Are Actually Testing

“Can you design systems that maintain themselves?”
Self‑healing infrastructure is a senior‑engineer concern. The specific technique (memory compaction) matters less than the pattern: observe → analyze → act → verify.

Benchmarking QMD vs. Default Memory

Environment

OpenClaw v2026.2.0+
Bun or Node 22+
4 GB RAM, ~2 GB disk for models

bun install -g https://github.com/tobi/qmd

Verify Installation

qmd --version
# Expected: qmd 0.4.2 or higher

Index Your Existing Memory

qmd collection add ~/.openclaw/agents/main/memory --name test-memory

Build Embeddings (first run takes 30‑60 s)

qmd embed --collection test-memory

Run a Hybrid Search

time qmd query "database connection pooling" --collection test-memory

Compare Token Counts

echo "QMD returns ~700 chars × 6 results = 4,200 chars max"

echo "Full MEMORY.md injection = $(wc -c MEMORY.md)"

Cost tip: If your MEMORY.md exceeds 2 000 tokens and you’re paying per‑token context injection, QMD pays for itself in a week.

Sources

How to Fix OpenClaw’s Memory Search with QMD – José Casanova
OpenClaw Memory Documentation
QMD Skill – OpenClaw Skills Playbook
Tobi Lütke on QMD Integration

OpenClaw QMD: Local Hybrid Search for 10x Smarter Memory

Why Default Memory Fails at Scale

The problems compound

QMD solves all four

Hybrid Search – Three Stages

Stage 1 – BM25 (Keyword Matching)

Stage 2 – Vector Search (Semantic Matching)

Stage 3 – LLM Reranking (Precision Boost)

Why Run Locally?

QMD Architecture

QMD’s Model Context Protocol (MCP) Server

TL;DR

Enabling Self‑Healing Memory Workflows

MCP Interface

What Interviewers Are Actually Testing

Benchmarking QMD vs. Default Memory

Verify Installation

Index Your Existing Memory

Build Embeddings (first run takes 30‑60 s)

Run a Hybrid Search

Compare Token Counts

Sources

Related posts

How I Cut My AI Agent Costs by 75 Percent

LLM System Design Checklist: 7 Things I Wish Every AI Engineer Knew Before Building an AI App

Why LLMs Alone Are Not Agents

Your prompts have a vendor lock-in problem and it's hiding in plain text

Why Default Memory Fails at Scale

The problems compound

QMD solves all four

Hybrid Search – Three Stages

Stage 1 – BM25 (Keyword Matching)

Stage 2 – Vector Search (Semantic Matching)

Stage 3 – LLM Reranking (Precision Boost)

Why Run Locally?

QMD Architecture

QMD’s Model Context Protocol (MCP) Server

TL;DR

Enabling Self‑Healing Memory Workflows

MCP Interface

What Interviewers Are Actually Testing

Benchmarking QMD vs. Default Memory

Verify Installation

Index Your Existing Memory

Build Embeddings (first run takes 30‑60 s)

Run a Hybrid Search

Compare Token Counts

Sources

Related posts

How I Cut My AI Agent Costs by 75 Percent

LLM System Design Checklist: 7 Things I Wish Every AI Engineer Knew Before Building an AI App

Why LLMs Alone Are Not Agents

Your prompts have a vendor lock-in problem and it's hiding in plain text

Stage 1 – BM25 (Keyword Matching)

Stage 2 – Vector Search (Semantic Matching)

Stage 3 – LLM Reranking (Precision Boost)

Build Embeddings (first run takes 30‑60 s)