Beyond RAG: Building an AI Companion with 'Deep Memory' using Knowledge Graphs
Source: Dev.to
Why Standard RAG Wasn’t Enough
Most AI‑memory systems today rely on vector RAG:
- Chunk the text.
- Convert each chunk to a vector.
- Retrieve the most similar chunks later.
This works great for finding a specific policy in a PDF, but it falls short for modeling human relationships and history.
Vectors capture similarity, not structure.
If my wife says, “I’m feeling overwhelmed today,” a vector search might surface a journal entry from three months ago that also contains the word overwhelm.
A Knowledge Graph, on the other hand, can represent the story:
"Project A" → CAUSED → "Stress" → RESULTED_IN → "Overwhelm"
I needed the AI to understand causality, not just keyword overlap.
Architecture Decision: Full‑Context Injection
I’m using Google’s Gemini models, which have massive context windows. Instead of retrieving a handful of small chunks, I can inject the entire compiled profile into the prompt.
Process
- Convert raw chat logs into a structured graph.
- Flatten the graph into a concise “User Manual” (plain‑text description of entities and relations).
- Feed that manual to the model before each interaction.
Using Graphiti (the open‑source graph‑indexing framework) for indexing, the compiled output shrank from ~35 k tokens to ~14 k tokens—far less than the original master prompt.
Introducing Synapse: The Architecture
The project is split into two logical layers:
| Layer | Tech Stack | Role |
|---|---|---|
| Body (Frontend) | React 19 + Convex | Real‑time UI and chat handling |
| Brain (Backend) | Python + FastAPI | Heavy data processing, graph management |
| Memory Engine | Graphiti + Neo4j | Knowledge‑graph storage & retrieval |
| Models | Gemini 3 Flash (graph building) Gemini 2.5 Flash (chat) | Cost‑effective, high‑throughput inference |
High‑Level View

How It Works: The “Deep Memory” Pipeline
The system runs in three distinct phases.
Phase A – Conversation (The Chat)
- The user talks to Gemini 2.5 Flash – fast, fluid responses.
- Before the first user message, the system prompt is hydrated with a text summary of the entire Knowledge Graph.
- The model instantly knows who the user is, what they’re worried about, and who their friends are.
Phase B – Ingestion (The “Sleep” Cycle)
When the conversation ends (3 h of inactivity or a manual “Consolidate” click), the transcript is sent to the Python Cortex where Gemini 3 Flash processes it.
Why Gemini 3?
Extracting entities from messy human dialogue is hard. Gemini 3 can understand nuanced statements and update the graph correctly.
Example:
“I stopped taking medication X and started Y.”
Gemini 3 produces the following logical updates:
- Find node Medication X.
- Add relationship
STOPPED. - Create node Medication Y.
- Add relationship
STARTED.

Phase C – Hydration (The Awakening)
When the user returns, the next session starts with a new compiled graph summary. The system doesn’t just dump raw triples; it turns nodes and edges into a natural‑language narrative that the model can read instantly.
def _format_compilation(definitions: list[str], relationships: list[str]) -> str:
"""
Turn a list of node definitions and relationship statements into a
readable, sectioned prompt for the LLM.
"""
sections = []
if definitions:
sections.append(
"#### 1. CONCEPTUAL ENTITIES\n" +
"\n".join(f"- {d}" for d in definitions)
)
if relationships:
sections.append(
"#### 2. RELATIONSHIPS\n" +
"\n".join(f"- {r}" for r in relationships)
)
# Add any additional formatting or ordering logic here.
return "\n\n".join(sections)
The compiled prompt (≈ 14 k tokens) is then prepended to the chat, giving the model a deep, structured memory of the user’s life.
Takeaways
- Knowledge Graphs capture structure and causality that vectors miss.
- Large‑context models (Gemini) let you inject a whole “user manual” instead of a handful of retrieved chunks.
- A three‑phase pipeline—Chat → Sleep → Hydration—mirrors how humans consolidate memories.
Synapse AI Chat turned a 35 k‑token manual into a 14 k‑token, graph‑driven “continuous brain” that feels personal, context‑aware, and cheap to run.
If you’re interested in the code or want to try it yourself, feel free to open an issue or drop a comment below!
The “Killer Feature”: Memory Explorer
AI memory is usually a “Black Box.” Users don’t trust what they can’t see.
I wanted my wife to be able to audit her own brain, so I built a visualizer using react‑force‑graph. She can see bubbles representing her life: Work, Health, Family.
If she sees a connection that is wrong (e.g., the AI thinks she likes a food she actually hates), she can edit the input and re‑process the graph with new information like “I actually hate mushrooms now.”
The system then processes that new input, updates the graph, creates new nodes/relations, or invalidates the existing ones. This human‑in‑the‑loop approach builds massive trust.
Engineering Challenges
Building this wasn’t just about prompt engineering. There were real system challenges.
1. Handling Latency (The Job Queue)
Graph ingestion is slow – it takes 60 – 200 seconds for Graphiti and Gemini to process a long conversation and update Neo4j. I couldn’t let the UI hang for three minutes.
Solution: Use Convex as a job queue. When the session ends, the UI returns immediately. Convex processes the job in the background, updating the UI state to “Processing…” and then “Memory Updated” when it’s done.
2. Handling Flakiness (The Retry Logic)
The Gemini API is powerful, but it occasionally throws 503 Service Unavailable errors, especially during heavy graph‑processing tasks.
Solution: Implement an event‑driven retry system with exponential back‑off.
// retry delays (ms)
export const RETRY_DELAYS_MS = [
0, // Attempt 1: Immediate
2 * 60_000, // Attempt 2: +2 min (let the API cool down)
10 * 60_000, // Attempt 3: +10 min
30 * 60_000, // Attempt 4: +30 min
];
export const processJob = internalAction({
args: { jobId: v.id("cortex_jobs") },
handler: async (ctx, args) => {
const job = await ctx.runQuery(internal.cortexJobs.get, { id: args.jobId });
try {
// 1️⃣ Heavy lifting (Call Gemini 3 Flash)
// This is where 503 errors usually happen
await ingestGraphData(ctx, job.payload);
// 2️⃣ Mark complete if successful
await ctx.runMutation(internal.cortexJobs.complete, { jobId: args.jobId });
} catch (error) {
const nextAttempt = job.attempts + 1;
if (nextAttempt >= job.maxAttempts) {
// Stop after too many tries
await ctx.runMutation(internal.cortexJobs.fail, {
jobId: args.jobId,
error: String(error),
});
} else {
// 3️⃣ Schedule the retry using Convex's scheduler
const delay = RETRY_DELAYS_MS[nextAttempt] ?? 30 * 60_000;
await ctx.scheduler.runAfter(
delay,
internal.processor.processJob,
{ jobId: args.jobId }
);
}
}
},
});
3. Snappy UX
Convex’s real‑time sync was a lifesaver. I didn’t have to write complex WebSocket code. When the Python backend updates the status of a memory job in the database, the React UI updates instantly.
Token streaming works better with Convex in the middle because the backend stays connected to Convex. If the user’s browser closes or the connection fails, token generation continues, passing the answer to Convex and streaming it to the user when possible.
Caveat: Each update counts toward function usage, so streaming updates are throttled to 100 ms intervals to balance responsiveness with database‑write efficiency.
The Result
The difference is night‑and‑day.
| Before | After |
|---|---|
| My wife dreaded starting a new thread because of the “context set‑up” tax. She felt she was constantly repeating herself and had to manually update the master prompt with new data. | She just talks. The system maintains a Deep Memory of about 10 000 tokens (compressed from months of chats) that is injected automatically. |
| Separate threads were isolated; context didn’t carry over. | All threads share the same Cortex. If she mentions a health issue in the “Work” thread (e.g., “My back hurts from sitting”), the “Health” thread knows about it the next time she logs in. |
Conclusion
This project taught me that we are moving from horizontal AI platforms (like ChatGPT, which knows a little about everything) to vertical AI stacks that know everything about you.
ChatGPT and Gemini are already adding user profiles and thread summaries to build this kind of memory. They’re chasing the same goal: a truly personalized experience.
Key takeaway:
- Vectors are great for search.
- Knowledge graphs are essential for understanding.
I keep enjoying building solutions for real problems. With today’s powerful tools, we can build awesome software fast and trust‑worthily.
The project is live at https://synapse-chat.juandastic.dev/ if you want to see it in action.
The code is open source if you want to dig into the implementation:
- Frontend (Body): https://github.com/juandastic/synapse-chat-ai
- Backend (Cortex): https://github.com/juandastic/synapse-cortex
I’d love to hear your impressions and thoughts. Let’s continue the conversation on X or connect on LinkedIn.
