Inside Memcortex: A Lightweight Semantic Memory Layer for LLMs
Source: Dev.to
Why Context Matters
An LLM cannot truly store past conversations. Its only memory is the context window, a fixed‑length input buffer (e.g., 128 k tokens in GPT‑4.1, 200 k+ in Claude 3.5 Sonnet, and up to 2 million tokens in Gemini 1.5 Pro). When the conversation exceeds that limit, the orchestrator must perform three critical steps for the next query:
- Decide what information is most important.
- Compress or summarise the history.
- Re‑inject relevant history into the prompt.
For developers building custom agents, this crucial orchestration layer does not come out of the box even when integrating APIs provided by these hyperscale AI assistants. You have to build your own, and that necessity is where the idea for MemCortex originated.
What Memcortex Does Differently
The core difference is that MemCortex is a semantic memory layer, not just a simple list of all previous conversations. Instead of pushing raw text history into each request, MemCortex stores vector embeddings of past messages and retrieves only the relevant ones using vector search. This architecture aligns with the industry pattern known as Retrieval‑Augmented Generation (RAG).
MemCortex uses:
- Ollama to run the open‑source
nomic-embed-textembedding model locally for fast, privacy‑preserving vector generation. - Weaviate for vector storage and indexing.
All components are packaged into a single Docker container, making MemCortex a portable, customizable memory layer that runs locally, on servers, or in the cloud. With a single exposed /chat endpoint, MemCortex acts as a context‑rich middleware for your applications.
How it Works (High‑Level)

Ingestion
- Take every new message or event.
- Generate an embedding vector using the
nomic-embed-textmodel via Ollama. - Store the original text, its vector, and associated metadata (e.g., timestamps).
Retrieval
- A new user query arrives.
- Embed the query.
- Perform a vector search in Weaviate.
- Fetch the top‑k similar items as “memories”.
- Inject only these relevant memories back into the LLM context.
This process mirrors how enterprise AI systems handle long‑term coherence, but MemCortex provides a lightweight, developer‑friendly version.
Why I Built It: Solving the Memory Problem for Agents
When building a sophisticated AI agent, you need three things:
- Long‑Term Recall – remember important facts across sessions.
- Relevance – retrieve only context relevant to the current task.
- Efficiency – avoid feeding the entire conversation into every prompt.
MemCortex addresses these points through specific features:
- Relevance Scoring – configurable vector distance score and relevance threshold.
- Max Memory Distance – tunable environment variable ensures only high‑similarity memories are returned.
- Persistence – Weaviate stores memories beyond process restarts, essential for real‑world agents.
- Pluggable Backends – easily swap embedding models, vector stores, or add custom ranking logic.
Where MemCortex Fits Today
MemCortex is a proof‑of‑concept (POC) / production‑ready scaffold. It is a powerful foundation for:
- AI agents
- Customer‑support bots
- Workflow assistants
- Knowledge‑augmented chat systems
- Memory‑RAG prototypes
It is designed to be simple, flexible, and intentionally un‑opinionated about the surrounding application logic.
Limitations
While a powerful scaffold, MemCortex has constraints as a standalone component:
- Scalability and speed depend entirely on your chosen storage/indexing solution.
- Accuracy and relevance depend on the quality of the embeddings and retrieval logic.
- Persistence, backups, and security are the responsibility of the developer integrating the container.
- Cost scales with storage, embeddings, and retrieval frequency.
- It does not inherently reason, summarise, or prioritise beyond the retrieval logic you implement.
Future Enhancements
Potential next steps for an evolving system include:
- Temporal scoring (recency decay)
- Memory summarisation
- Topic clustering (for more efficient retrieval)
- Multi‑vector per memory
- Event‑driven memory (“only save meaningful messages”)
- Emotional/contextual tagging
Existing open‑source projects like LangMem provide tooling to extract important information from conversations, optimise agent behaviour through prompt refinement, and maintain long‑term memory.
Conclusion
MemCortex is a small but critical step toward giving your AI‑powered applications the persistent, semantic memory they need to move from short‑term chat partners to capable long‑term agents. As AI agents grow more capable, systems like this will bridge the gap between short‑term context and true long‑term reasoning. For those interested in extending, optimising, or integrating with the system, the source code is available on GitHub.