OpenClaw QMD: Local Hybrid Search for 10x Smarter Memory

Published: (February 22, 2026 at 06:55 AM EST)
6 min read
Source: Dev.to

Source: Dev.to

Why Default Memory Fails at Scale

OpenClaw’s built‑in memory is simple:

  1. Append to MEMORY.md.
  2. Inject the whole file into every prompt.

It works fine at ~500 tokens, but it falls apart at ~5 000.

The problems compound

IssueSymptom
Token explosionEvery message pays the full context tax. A 10‑token query drags ~4 000 tokens of memory. Your $0.01 API call becomes $0.15.
Relevance collapseThe model sees everything, prioritises nothing. Asking about “database connection pooling” weighs your lunch preferences equally.
No semantic understandingKeyword matching alone misses synonyms. “DB connection” won’t find notes about “PostgreSQL pooling” unless you used those exact words.
Cloud dependencyVector search usually means Pinecone, Weaviate, or another hosted service. Your private notes now live on someone else’s servers.

QMD solves all four

Indexes your markdown files locally, runs hybrid retrieval combining three search strategies, and returns only the relevant snippets.

  • 700 characters max per result (default = 6 results).
  • Your 10 000‑token memory footprint becomes ≈ 200 tokens of gold.

What interviewers are actually testing: Can you explain the token economics of context injection?
Insight: Context length is O(n) cost, but relevance is what matters. Retrieval‑augmented generation (RAG) exists because “just include everything” doesn’t scale.

Hybrid Search – Three Stages

Stage 1 – BM25 (Keyword Matching)

Classic IR: term frequency, inverse document frequency, document‑length normalisation.

Score = Σ IDF(term) × TF(term,doc) × (k₁+1) / (TF + k₁ × (1‑b + b × |doc|/avgdl))
  • Fast, deterministic, great for exact matches.
  • Example: searching “SwiftUI navigation” finds docs containing those exact terms.

Limitation: Misses semantic relationships (e.g., “iOS routing” won’t match “SwiftUI navigation”).

Stage 2 – Vector Search (Semantic Matching)

  • Uses Jina v3 embeddings (local, ~1 GB GGUF model).
  • Text → 1024‑dimensional vector; similar meanings cluster together.
  • “iOS routing” lands near “SwiftUI navigation.”

The embedding model downloads automatically on first run. No API keys, no cloud calls. Your notes never leave your machine.

Stage 3 – LLM Reranking (Precision Boost)

After BM25 and vector search return candidates, a local LLM reranks them by actual relevance to your query.

Query: "Ray's SwiftUI style"
├── BM25 candidates (10)
├── Vector candidates (10)
└── LLM reranker → Top 6 results (≈700 chars each)
  • Catches cases where both keyword and semantic matches miss the point.
  • Example: a snippet about Ray’s code‑review preferences beats a snippet that merely mentions SwiftUI in passing.

What interviewers are actually testing: Hybrid search is the 2026 standard for production RAG. Pure vector search has recall problems; pure BM25 has semantic problems. The combination + reranking is how you build retrieval that actually works.

Why Run Locally?

BenefitReason
CostVector‑search APIs charge per query. QMD is free after the initial model download.
PrivacyAgent memory contains sensitive context (project names, credentials, personal preferences). Keeping it local means keeping it private.
LatencyNetwork round‑trips add 50‑200 ms per query. Local inference is faster, especially when you run multiple retrievals per turn.

Trade‑off: Compute. You need a machine with enough RAM to load the models (~4 GB recommended). Cloud instances work, but you pay for compute instead of API calls.

What interviewers are actually testing: The build‑vs‑buy decision for ML infrastructure. Local models trade API costs for compute costs. The break‑even point depends on query volume, latency requirements, and privacy constraints. Know your numbers.

QMD Architecture

  • Rust CLI – fast, single binary, cross‑platform.
  • GGUF models – quantised for local inference (~1 GB total).
  • SQLite indexes – BM25 + metadata stored locally.
  • Jina v3 embeddings – 1024‑dim vectors, multilingual.

On a Mac Mini M2, embedding 1 000 markdown files takes ~30 s. Queries return in ≈ 100 ms.
What interviewers are actually testing: System integration patterns. How do you replace a component (memory backend) without breaking the rest of the system? The answer involves clean interfaces, configuration‑driven switching, and graceful degradation if the new backend fails.

QMD’s Model Context Protocol (MCP) Server

QMD exposes an MCP server so agents can query memory programmatically (e.g., via HTTP/JSON). This enables:

  • Language‑agnostic access from any agent implementation.
  • Fine‑grained control over retrieval parameters (max results, char limits, filters).
  • Graceful fallback to the built‑in memory if the MCP server is unavailable.

TL;DR

  • Default “inject‑everything” memory doesn’t scale.
  • Hybrid search (BM25 + vector + LLM rerank) gives cheap, private, low‑latency retrieval.
  • QMD provides this locally, integrates with OpenClaw via a simple config, and lets you keep your agent’s memory fast, relevant, and secure.

Enabling Self‑Healing Memory Workflows

Example: a compaction skill that prunes outdated entries

// Memory compaction skill
const staleEntries = await qmd.query({
  collection: "agent-logs",
  filter: { olderThan: "30d", accessCount: 0 }
});

for (const entry of staleEntries) {
  if (await confirmDeletion(entry)) {
    await qmd.delete(entry.id);
  }
}

await qmd.reindex();

MCP Interface

CommandDescription
queryHybrid search with filters
addInsert new memory entries
updateModify existing entries
deleteRemove stale content
reindexRebuild embeddings after bulk changes

This turns memory from a passive store into an active system. Agents can curate their own context, pruning irrelevant entries and promoting useful ones.

Pattern tip: Run a nightly job that analyzes query patterns, identifies entries that never get retrieved, and archives them. Memory stays lean without manual curation.

What Interviewers Are Actually Testing

“Can you design systems that maintain themselves?”
Self‑healing infrastructure is a senior‑engineer concern. The specific technique (memory compaction) matters less than the pattern: observe → analyze → act → verify.

Benchmarking QMD vs. Default Memory

Environment

  • OpenClaw v2026.2.0+
  • Bun or Node 22+
  • 4 GB RAM, ~2 GB disk for models
bun install -g https://github.com/tobi/qmd

Verify Installation

qmd --version
# Expected: qmd 0.4.2 or higher

Index Your Existing Memory

qmd collection add ~/.openclaw/agents/main/memory --name test-memory

Build Embeddings (first run takes 30‑60 s)

qmd embed --collection test-memory
time qmd query "database connection pooling" --collection test-memory

Compare Token Counts

echo "QMD returns ~700 chars × 6 results = 4,200 chars max"
echo "Full MEMORY.md injection = $(wc -c MEMORY.md)"

Cost tip: If your MEMORY.md exceeds 2 000 tokens and you’re paying per‑token context injection, QMD pays for itself in a week.

Sources

  • How to Fix OpenClaw’s Memory Search with QMD – José Casanova
  • OpenClaw Memory Documentation
  • QMD Skill – OpenClaw Skills Playbook
  • Tobi Lütke on QMD Integration
0 views
Back to Blog

Related posts

Read more »

Stop Wasting Context

Introduction OpenAI says “Context is a scarce resource.” Treat it like one. A giant instruction file may feel safe and thorough, but it crowds out the actual t...