Inside Memcortex: A Lightweight Semantic Memory Layer for LLMs

Published: 1 month ago (December 13, 2025 at 11:20 AM EST)

4 min read

Source: Dev.to

Why Context Matters

An LLM cannot truly store past conversations. Its only memory is the context window, a fixed‑length input buffer (e.g., 128 k tokens in GPT‑4.1, 200 k+ in Claude 3.5 Sonnet, and up to 2 million tokens in Gemini 1.5 Pro). When the conversation exceeds that limit, the orchestrator must perform three critical steps for the next query:

Decide what information is most important.
Compress or summarise the history.
Re‑inject relevant history into the prompt.

For developers building custom agents, this crucial orchestration layer does not come out of the box even when integrating APIs provided by these hyperscale AI assistants. You have to build your own, and that necessity is where the idea for MemCortex originated.

What Memcortex Does Differently

The core difference is that MemCortex is a semantic memory layer, not just a simple list of all previous conversations. Instead of pushing raw text history into each request, MemCortex stores vector embeddings of past messages and retrieves only the relevant ones using vector search. This architecture aligns with the industry pattern known as Retrieval‑Augmented Generation (RAG).

MemCortex uses:

Ollama to run the open‑source nomic-embed-text embedding model locally for fast, privacy‑preserving vector generation.
Weaviate for vector storage and indexing.

All components are packaged into a single Docker container, making MemCortex a portable, customizable memory layer that runs locally, on servers, or in the cloud. With a single exposed /chat endpoint, MemCortex acts as a context‑rich middleware for your applications.

How it Works (High‑Level)

High‑level architecture of how MemCortex is used

Ingestion

Take every new message or event.
Generate an embedding vector using the nomic-embed-text model via Ollama.
Store the original text, its vector, and associated metadata (e.g., timestamps).

Retrieval

A new user query arrives.
Embed the query.
Perform a vector search in Weaviate.
Fetch the top‑k similar items as “memories”.
Inject only these relevant memories back into the LLM context.

This process mirrors how enterprise AI systems handle long‑term coherence, but MemCortex provides a lightweight, developer‑friendly version.

Why I Built It: Solving the Memory Problem for Agents

When building a sophisticated AI agent, you need three things:

Long‑Term Recall – remember important facts across sessions.
Relevance – retrieve only context relevant to the current task.
Efficiency – avoid feeding the entire conversation into every prompt.

MemCortex addresses these points through specific features:

Relevance Scoring – configurable vector distance score and relevance threshold.
Max Memory Distance – tunable environment variable ensures only high‑similarity memories are returned.
Persistence – Weaviate stores memories beyond process restarts, essential for real‑world agents.
Pluggable Backends – easily swap embedding models, vector stores, or add custom ranking logic.

Where MemCortex Fits Today

MemCortex is a proof‑of‑concept (POC) / production‑ready scaffold. It is a powerful foundation for:

AI agents
Customer‑support bots
Workflow assistants
Knowledge‑augmented chat systems
Memory‑RAG prototypes

It is designed to be simple, flexible, and intentionally un‑opinionated about the surrounding application logic.

Limitations

While a powerful scaffold, MemCortex has constraints as a standalone component:

Scalability and speed depend entirely on your chosen storage/indexing solution.
Accuracy and relevance depend on the quality of the embeddings and retrieval logic.
Persistence, backups, and security are the responsibility of the developer integrating the container.
Cost scales with storage, embeddings, and retrieval frequency.
It does not inherently reason, summarise, or prioritise beyond the retrieval logic you implement.

Future Enhancements

Potential next steps for an evolving system include:

Temporal scoring (recency decay)
Memory summarisation
Topic clustering (for more efficient retrieval)
Multi‑vector per memory
Event‑driven memory (“only save meaningful messages”)
Emotional/contextual tagging

Existing open‑source projects like LangMem provide tooling to extract important information from conversations, optimise agent behaviour through prompt refinement, and maintain long‑term memory.

Conclusion

MemCortex is a small but critical step toward giving your AI‑powered applications the persistent, semantic memory they need to move from short‑term chat partners to capable long‑term agents. As AI agents grow more capable, systems like this will bridge the gap between short‑term context and true long‑term reasoning. For those interested in extending, optimising, or integrating with the system, the source code is available on GitHub.

Inside Memcortex: A Lightweight Semantic Memory Layer for LLMs

Why Context Matters

What Memcortex Does Differently

How it Works (High‑Level)

Ingestion

Retrieval

Why I Built It: Solving the Memory Problem for Agents

Where MemCortex Fits Today

Limitations

Future Enhancements

Conclusion

Related posts

Prompt Length vs. Context Window: The Real Limits Behind LLM Performance

Guardrail your LLMs

Anthropic Skills. The Landscape for New Models and Architecture

From Prompts to Action: My Journey Through the Google & Kaggle AI Agents Bootcamp