Beyond RAG: Building an AI Companion with 'Deep Memory' using Knowledge Graphs

Published: 2 months ago (February 8, 2026 at 07:07 PM EST)

8 min read

Source: Dev.to

Source: Dev.to

Why Standard RAG Wasn’t Enough

Most AI‑memory systems today rely on vector RAG:

Chunk the text.
Convert each chunk to a vector.
Retrieve the most similar chunks later.

This works great for finding a specific policy in a PDF, but it falls short for modeling human relationships and history.

Vectors capture similarity, not structure.

If my wife says, “I’m feeling overwhelmed today,” a vector search might surface a journal entry from three months ago that also contains the word overwhelm.
A Knowledge Graph, on the other hand, can represent the story:

"Project A" → CAUSED → "Stress" → RESULTED_IN → "Overwhelm"

I needed the AI to understand causality, not just keyword overlap.

Architecture Decision: Full‑Context Injection

I’m using Google’s Gemini models, which have massive context windows. Instead of retrieving a handful of small chunks, I can inject the entire compiled profile into the prompt.

Process

Convert raw chat logs into a structured graph.
Flatten the graph into a concise “User Manual” (plain‑text description of entities and relations).
Feed that manual to the model before each interaction.

Using Graphiti (the open‑source graph‑indexing framework) for indexing, the compiled output shrank from ~35 k tokens to ~14 k tokens—far less than the original master prompt.

Introducing Synapse: The Architecture

The project is split into two logical layers:

Layer	Tech Stack	Role
Body (Frontend)	React 19 + Convex	Real‑time UI and chat handling
Brain (Backend)	Python + FastAPI	Heavy data processing, graph management
Memory Engine	Graphiti + Neo4j	Knowledge‑graph storage & retrieval
Models	Gemini 3 Flash (graph building) Gemini 2.5 Flash (chat)	Cost‑effective, high‑throughput inference

High‑Level View

high‑level view

How It Works: The “Deep Memory” Pipeline

The system runs in three distinct phases.

Phase A – Conversation (The Chat)

The user talks to Gemini 2.5 Flash – fast, fluid responses.
Before the first user message, the system prompt is hydrated with a text summary of the entire Knowledge Graph.
The model instantly knows who the user is, what they’re worried about, and who their friends are.

Phase B – Ingestion (The “Sleep” Cycle)

When the conversation ends (3 h of inactivity or a manual “Consolidate” click), the transcript is sent to the Python Cortex where Gemini 3 Flash processes it.

Why Gemini 3?
Extracting entities from messy human dialogue is hard. Gemini 3 can understand nuanced statements and update the graph correctly.

Example:

“I stopped taking medication X and started Y.”

Gemini 3 produces the following logical updates:

Find node Medication X.
Add relationship STOPPED.
Create node Medication Y.
Add relationship STARTED.

entity extraction example

Phase C – Hydration (The Awakening)

When the user returns, the next session starts with a new compiled graph summary. The system doesn’t just dump raw triples; it turns nodes and edges into a natural‑language narrative that the model can read instantly.

def _format_compilation(definitions: list[str], relationships: list[str]) -> str:
    """
    Turn a list of node definitions and relationship statements into a
    readable, sectioned prompt for the LLM.
    """
    sections = []

    if definitions:
        sections.append(
            "#### 1. CONCEPTUAL ENTITIES\n" +
            "\n".join(f"- {d}" for d in definitions)
        )

    if relationships:
        sections.append(
            "#### 2. RELATIONSHIPS\n" +
            "\n".join(f"- {r}" for r in relationships)
        )

    # Add any additional formatting or ordering logic here.
    return "\n\n".join(sections)

The compiled prompt (≈ 14 k tokens) is then prepended to the chat, giving the model a deep, structured memory of the user’s life.

Takeaways

Knowledge Graphs capture structure and causality that vectors miss.
Large‑context models (Gemini) let you inject a whole “user manual” instead of a handful of retrieved chunks.
A three‑phase pipeline—Chat → Sleep → Hydration—mirrors how humans consolidate memories.

Synapse AI Chat turned a 35 k‑token manual into a 14 k‑token, graph‑driven “continuous brain” that feels personal, context‑aware, and cheap to run.

If you’re interested in the code or want to try it yourself, feel free to open an issue or drop a comment below!

The “Killer Feature”: Memory Explorer

AI memory is usually a “Black Box.” Users don’t trust what they can’t see.

I wanted my wife to be able to audit her own brain, so I built a visualizer using react‑force‑graph. She can see bubbles representing her life: Work, Health, Family.

If she sees a connection that is wrong (e.g., the AI thinks she likes a food she actually hates), she can edit the input and re‑process the graph with new information like “I actually hate mushrooms now.”

The system then processes that new input, updates the graph, creates new nodes/relations, or invalidates the existing ones. This human‑in‑the‑loop approach builds massive trust.

Engineering Challenges

Building this wasn’t just about prompt engineering. There were real system challenges.

1. Handling Latency (The Job Queue)

Graph ingestion is slow – it takes 60 – 200 seconds for Graphiti and Gemini to process a long conversation and update Neo4j. I couldn’t let the UI hang for three minutes.

Solution: Use Convex as a job queue. When the session ends, the UI returns immediately. Convex processes the job in the background, updating the UI state to “Processing…” and then “Memory Updated” when it’s done.

2. Handling Flakiness (The Retry Logic)

The Gemini API is powerful, but it occasionally throws 503 Service Unavailable errors, especially during heavy graph‑processing tasks.

Solution: Implement an event‑driven retry system with exponential back‑off.

// retry delays (ms)
export const RETRY_DELAYS_MS = [
  0,            // Attempt 1: Immediate
  2 * 60_000,   // Attempt 2: +2 min (let the API cool down)
  10 * 60_000,  // Attempt 3: +10 min
  30 * 60_000,  // Attempt 4: +30 min
];

export const processJob = internalAction({
  args: { jobId: v.id("cortex_jobs") },
  handler: async (ctx, args) => {
    const job = await ctx.runQuery(internal.cortexJobs.get, { id: args.jobId });

    try {
      // 1️⃣ Heavy lifting (Call Gemini 3 Flash)
      // This is where 503 errors usually happen
      await ingestGraphData(ctx, job.payload);

      // 2️⃣ Mark complete if successful
      await ctx.runMutation(internal.cortexJobs.complete, { jobId: args.jobId });

    } catch (error) {
      const nextAttempt = job.attempts + 1;

      if (nextAttempt >= job.maxAttempts) {
        // Stop after too many tries
        await ctx.runMutation(internal.cortexJobs.fail, {
          jobId: args.jobId,
          error: String(error),
        });
      } else {
        // 3️⃣ Schedule the retry using Convex's scheduler
        const delay = RETRY_DELAYS_MS[nextAttempt] ?? 30 * 60_000;

        await ctx.scheduler.runAfter(
          delay,
          internal.processor.processJob,
          { jobId: args.jobId }
        );
      }
    }
  },
});

3. Snappy UX

Convex’s real‑time sync was a lifesaver. I didn’t have to write complex WebSocket code. When the Python backend updates the status of a memory job in the database, the React UI updates instantly.

Token streaming works better with Convex in the middle because the backend stays connected to Convex. If the user’s browser closes or the connection fails, token generation continues, passing the answer to Convex and streaming it to the user when possible.

Caveat: Each update counts toward function usage, so streaming updates are throttled to 100 ms intervals to balance responsiveness with database‑write efficiency.

The Result

The difference is night‑and‑day.

Before	After
My wife dreaded starting a new thread because of the “context set‑up” tax. She felt she was constantly repeating herself and had to manually update the master prompt with new data.	She just talks. The system maintains a Deep Memory of about 10 000 tokens (compressed from months of chats) that is injected automatically.
Separate threads were isolated; context didn’t carry over.	All threads share the same Cortex. If she mentions a health issue in the “Work” thread (e.g., “My back hurts from sitting”), the “Health” thread knows about it the next time she logs in.

Conclusion

This project taught me that we are moving from horizontal AI platforms (like ChatGPT, which knows a little about everything) to vertical AI stacks that know everything about you.

ChatGPT and Gemini are already adding user profiles and thread summaries to build this kind of memory. They’re chasing the same goal: a truly personalized experience.

Key takeaway:

Vectors are great for search.
Knowledge graphs are essential for understanding.

I keep enjoying building solutions for real problems. With today’s powerful tools, we can build awesome software fast and trust‑worthily.

The project is live at https://synapse-chat.juandastic.dev/ if you want to see it in action.

The code is open source if you want to dig into the implementation:

Frontend (Body): https://github.com/juandastic/synapse-chat-ai
Backend (Cortex): https://github.com/juandastic/synapse-cortex

I’d love to hear your impressions and thoughts. Let’s continue the conversation on X or connect on LinkedIn.

Beyond RAG: Building an AI Companion with 'Deep Memory' using Knowledge Graphs

Why Standard RAG Wasn’t Enough

Architecture Decision: Full‑Context Injection

Introducing Synapse: The Architecture

High‑Level View

How It Works: The “Deep Memory” Pipeline

Phase A – Conversation (The Chat)

Phase B – Ingestion (The “Sleep” Cycle)

Phase C – Hydration (The Awakening)

Takeaways

The “Killer Feature”: Memory Explorer

Engineering Challenges

1. Handling Latency (The Job Queue)

2. Handling Flakiness (The Retry Logic)

3. Snappy UX

The Result

Conclusion

Related posts

Unlocking Enterprise AI with Context Engineering: A Game-Changer Revealed

Beyond the Hype: Understanding How AI Agents Actually Work (And Why They Mirror How You Function)

Good vs Bad Prompting: What I Learned While Working With AI Models

Study: Platforms that rank the latest LLMs can be unreliable

Why Standard RAG Wasn’t Enough

Architecture Decision: Full‑Context Injection

Introducing Synapse: The Architecture

High‑Level View

How It Works: The “Deep Memory” Pipeline

Phase A – Conversation (The Chat)

Phase B – Ingestion (The “Sleep” Cycle)

Phase C – Hydration (The Awakening)

Takeaways

The “Killer Feature”: Memory Explorer

Engineering Challenges

1. Handling Latency (The Job Queue)

2. Handling Flakiness (The Retry Logic)

3. Snappy UX

The Result

Conclusion

Related posts

Unlocking Enterprise AI with Context Engineering: A Game-Changer Revealed

Beyond the Hype: Understanding How AI Agents Actually Work (And Why They Mirror How You Function)

Good vs Bad Prompting: What I Learned While Working With AI Models

Study: Platforms that rank the latest LLMs can be unreliable

Phase A – Conversation (The Chat)

Phase B – Ingestion (The “Sleep” Cycle)

Phase C – Hydration (The Awakening)