How I Built a Self-Updating Neo4j Knowledge Graph from Meeting Notes (That Saves 99% on LLM Costs)
Source: Dev.to
The Problem: Your Meeting Notes Are Wasted
Every day, organizations hold 62‑80 million meetings in the US alone. Those meetings generate decisions, action items, and task assignments—but most of that intelligence dies in Google Docs.
Want to know “Who was in all the budget meetings?” or “What tasks did Alex get assigned this month?” Good luck searching through thousands of Markdown files.
The real killer? Meeting notes are living documents. People fix names, reassign tasks, update decisions. Without incremental processing, you’re stuck choosing between:
- 💸 Massive LLM bills from reprocessing everything
- 📉 A stale, outdated knowledge graph
I solved this by building a self‑updating Neo4j knowledge graph that only processes changed documents—cutting LLM costs by 99%.
What We’re Building
A pipeline that turns messy meeting notes into a queryable graph database:
Google Drive → Detect Changes → Split Meetings → LLM Extract → Neo4j
Result: Three node types (Meeting, Person, Task) and three relationships (ATTENDED, DECIDED, ASSIGNED_TO) that let you query:
- “Which meetings did Sarah attend?”
- “Where was this task decided?”
- “Who owns all Q4 tasks?”
The Secret Sauce: Incremental Processing
1. Only Process What Changed
The Google Drive source tracks last‑modified timestamps. When you have 100 000 meeting notes and only 1 % change daily, you process 1 000 files—not 100 000.
@cocoindex.flow_def(name="MeetingNotesGraph")
def meeting_notes_graph_flow(
flow_builder: cocoindex.FlowBuilder,
data_scope: cocoindex.DataScope
) -> None:
credential_path = os.environ["GOOGLE_SERVICE_ACCOUNT_CREDENTIAL"]
root_folder_ids = os.environ["GOOGLE_DRIVE_ROOT_FOLDER_IDS"].split(",")
data_scope["documents"] = flow_builder.add_source(
cocoindex.sources.GoogleDrive(
service_account_credential_path=credential_path,
root_folder_ids=root_folder_ids,
recent_changes_poll_interval=datetime.timedelta(seconds=10),
),
refresh_interval=datetime.timedelta(minutes=1),
)
Impact: 99 % reduction in LLM API costs for typical 1 % daily churn.
2. Smart Document Splitting
Meeting files often contain multiple sessions. Split them intelligently while keeping the header (e.g., ## Meeting Title) with each section to preserve context for the LLM.
with data_scope["documents"].row() as document:
document["meetings"] = document["content"].transform(
cocoindex.functions.SplitBySeparators(
separators_regex=[r"\n\n##?\ "],
keep_separator="RIGHT",
)
)
3. Structured LLM Extraction
Define a concrete schema instead of asking the model for “some JSON.”
@dataclass
class Person:
name: str
@dataclass
class Task:
description: str
assigned_to: list[Person]
@dataclass
class Meeting:
time: datetime.date
note: str
organizer: Person
participants: list[Person]
tasks: list[Task]
Extract with caching; identical inputs reuse cached outputs, eliminating redundant LLM calls.
with document["meetings"].row() as meeting:
parsed = meeting["parsed"] = meeting["text"].transform(
cocoindex.functions.ExtractByLlm(
llm_spec=cocoindex.LlmSpec(
api_type=cocoindex.LlmApiType.OPENAI,
model="gpt-4",
),
output_type=Meeting,
)
)
Building the Graph
Collect Nodes and Relationships
meeting_nodes = data_scope.add_collector()
attended_rels = data_scope.add_collector()
decided_tasks_rels = data_scope.add_collector()
assigned_rels = data_scope.add_collector()
meeting_key = {"note_file": document["filename"], "time": parsed["time"]}
meeting_nodes.collect(**meeting_key, note=parsed["note"])
attended_rels.collect(
id=cocoindex.GeneratedField.UUID,
**meeting_key,
person=parsed["organizer"]["name"],
is_organizer=True,
)
with parsed["participants"].row() as participant:
attended_rels.collect(
id=cocoindex.GeneratedField.UUID,
**meeting_key,
person=participant["name"],
)
Export to Neo4j with Upsert Logic
meeting_nodes.export(
"meeting_nodes",
cocoindex.targets.Neo4j(
connection=conn_spec,
mapping=cocoindex.targets.Nodes(label="Meeting"),
),
primary_key_fields=["note_file", "time"],
)
Declare Person and Task nodes:
flow_builder.declare(
cocoindex.targets.Neo4jDeclaration(
connection=conn_spec,
nodes_label="Person",
primary_key_fields=["name"],
)
)
flow_builder.declare(
cocoindex.targets.Neo4jDeclaration(
connection=conn_spec,
nodes_label="Task",
primary_key_fields=["description"],
)
)
Export relationships:
attended_rels.export(
"attended_rels",
cocoindex.targets.Neo4j(
connection=conn_spec,
mapping=cocoindex.targets.Relationships(
rel_type="ATTENDED",
source=cocoindex.targets.NodeFromFields(
label="Person",
fields=[cocoindex.targets.TargetFieldMapping(
source="person", target="name"
)],
),
target=cocoindex.targets.NodeFromFields(
label="Meeting",
fields=[
cocoindex.targets.TargetFieldMapping("note_file"),
cocoindex.targets.TargetFieldMapping("time"),
],
),
),
),
primary_key_fields=["id"],
)
Running the Pipeline
Setup
export OPENAI_API_KEY=sk-...
export GOOGLE_SERVICE_ACCOUNT_CREDENTIAL=/path/to/service_account.json
export GOOGLE_DRIVE_ROOT_FOLDER_IDS=folderId1,folderId2
pip install cocoindex
Build the Graph
cocoindex update main
Query in Neo4j Browser (http://localhost:7474)
// Who attended which meetings?
MATCH (p:Person)-[:ATTENDED]->(m:Meeting)
RETURN p, m
// Tasks decided in meetings
MATCH (m:Meeting)-[:DECIDED]->(t:Task)
RETURN m, t
// Task assignments by person
MATCH (p:Person)-[:ASSIGNED_TO]->(t:Task)
RETURN p, t
Why This Matters
1. Cost Savings at Scale
- Traditional approach: Reprocess 100 000 docs → 100 000 LLM calls
- Incremental approach: Process 1 000 changed docs → 1 000 LLM calls
Result: 99 % cost reduction.
2. Real‑Time Updates
Switch to live mode and the graph updates automatically when meeting notes change:
refresh_interval=datetime.timedelta(minutes=1)
3. Data Lineage
CocoIndex tracks every transformation, allowing you to trace any Neo4j node back through LLM extraction to the source document.
Beyond Meeting Notes
This pattern works for any text‑heavy domain where documents evolve over time, delivering cost‑effective, up‑to‑date knowledge graphs.