From Catalog Chaos to Real-Time Recommendations: Building a Product Graph with LLMs and Neo4j
Source: Dev.to
Most product recommendation systems I’ve seen are basically fancy keyword matchers. They work okay when you have millions of clicks to analyze, but they completely fall apart when:
- You launch a new product with zero interaction data 📉
- Your catalog is a mess of inconsistent tags and descriptions 🤦
- You want to explain WHY you’re recommending something (not just show a black‑box score)
I just built a real‑time recommendation engine that actually understands products using LLMs and graph databases. The core logic is only ~100 lines of Python.
The Secret Sauce: Product Taxonomy + Knowledge Graphs
Instead of relying on user behavior alone, we teach an LLM to understand:
- What a product actually is (fine‑grained taxonomy like “gel pen” not “office supplies”)
- What people buy together (complementary products like “gel pen” → “notebook”, “pen holder”)
All of this is stored in a Neo4j graph database where relationships become first‑class citizens. You can now query things like “show me all products that share a complementary taxonomy with this gel pen.”
Real‑World Example: The Gel Pen Problem
When someone browses a gel pen, a traditional recommender might show:
- Other gel pens (same category)
- Popular items (based on sales)
- Random “customers also bought” (if you have enough data)
With our approach, the LLM analyzes the product description and extracts:
- Primary taxonomy:
gel pen,writing instrument - Complementary taxonomy:
notebook,pencil case,desk organizer
The graph now knows these relationships, so viewing the gel pen can surface notebooks, planners, and organizers—with explainable connections.
The Architecture (Simplified)
Product JSONs → CocoIndex Pipeline → LLM Extraction → Neo4j Graph
1. Ingest Products as a Stream
We watch a folder of product JSON files with auto‑refresh:
data_scope["products"] = flow_builder.add_source(
cocoindex.sources.LocalFile(
path="products",
included_patterns=["*.json"]
),
refresh_interval=datetime.timedelta(seconds=5)
)
Every time a product file changes, it triggers a pipeline update—no manual rebuilds.
2. Clean and Normalize Data
We map raw JSON into a clean structure:
@cocoindex.op.function(behavior_version=2)
def extract_product_info(product: cocoindex.typing.Json, filename: str) -> ProductInfo:
return ProductInfo(
id=f"{filename.removesuffix('.json')}",
url=product["source"],
title=product["title"],
price=float(product["price"].lstrip("$").replace(",", "")),
detail=Template(PRODUCT_TEMPLATE).render(**product),
)
The detail field becomes a markdown “product sheet” that we feed to the LLM.
3. Let the LLM Do the Heavy Lifting
We define the taxonomy contract as dataclasses:
@dataclasses.dataclass
class ProductTaxonomy:
"""
A concise noun or short phrase based on core functionality.
Use lowercase, avoid brands/styles.
Be specific: "pen" not "office supplies".
"""
name: str
@dataclasses.dataclass
class ProductTaxonomyInfo:
taxonomies: list[ProductTaxonomy]
complementary_taxonomies: list[ProductTaxonomy]
Then we call the LLM:
taxonomy = data["detail"].transform(
cocoindex.functions.ExtractByLlm(
llm_spec=cocoindex.LlmSpec(
api_type=cocoindex.LlmApiType.OPENAI,
model="gpt-4.1"
),
output_type=ProductTaxonomyInfo
)
)
The LLM reads the markdown description and returns structured JSON matching our schema—no parsing nightmares.
4. Build the Knowledge Graph in Neo4j
We export three things:
- Product nodes:
id,title,price,url - Taxonomy nodes: unique labels like “gel pen”, “notebook”
- Relationships:
PRODUCT_TAXONOMYandPRODUCT_COMPLEMENTARY_TAXONOMY
product_node.export(
"product_node",
cocoindex.storages.Neo4j(
connection=conn_spec,
mapping=cocoindex.storages.Nodes(label="Product")
),
primary_key_fields=["id"],
)
Neo4j automatically deduplicates nodes by primary key. If five products all mention “notebook” as a complementary taxonomy, they all link to the same Taxonomy node.
Running It Live
After setting up Postgres (for CocoIndex’s incremental processing) and Neo4j, run:
pip install -e .
cocoindex update --setup main
You’ll see output such as:
documents: 9 added, 0 removed, 0 updated
Then open Neo4j Browser at http://localhost:7474 and execute:
MATCH p=()-->() RETURN p
Boom—your entire product graph visualized.
Why This Actually Works
- LLMs are excellent at text understanding – offload messy natural‑language interpretation to a model you control with schema and docstrings.
- Graphs are made for relationships – you get explainable connections and can run graph algorithms (PageRank, community detection, shortest path, etc.).
- Incremental updates are free – CocoIndex handles all the plumbing; add a product file, get an updated graph.
What You Can Build Next
- Add brand, material, or use‑case taxonomies as separate node types.
- Plug in clickstream data to weight edges or create
FREQUENTLY_BOUGHT_WITHrelationships. - Swap OpenAI for Ollama (on‑prem LLMs) when you need full control.
- Layer on graph algorithms to find product clusters or detect trending categories.
Try It Yourself
Full working code is open‑source:
👉 CocoIndex Product Recommendation Example
The repository includes:
- Complete flow definition
- LLM extraction ops
- Neo4j mappings
- Sample product JSONs
If you’re experimenting with LLM‑native data pipelines or graph‑based recommendations, I’d love to hear what you’re building. Drop a comment or tag me!
P.S. If you found this useful, give the CocoIndex repo a star ⭐.
P.P.S. You can also explore the pipeline visually with CocoInsight (free beta) — it’s like DevTools for your data pipeline, with zero data retention.