[Paper] MemRec: Collaborative Memory-Augmented Agentic Recommender System
Source: arXiv - 2601.08816v1
Overview
The paper introduces MemRec, a new architecture for recommender systems that separates the heavy‑lifting reasoning done by large language models (LLMs) from the management of a collaborative “memory” graph. By letting a lightweight model (LM_Mem) curate and update a shared semantic memory, the downstream recommender LLM (LM_Rec) can focus on generating high‑quality recommendations without being bogged down by massive graph data. This design tackles two long‑standing pain points:
- How to feed rich collaborative signals to LLM‑based recommenders without overwhelming them.
- How to keep that collaborative knowledge fresh without exploding compute costs.
Key Contributions
- Decoupled Architecture – Introduces a two‑stage pipeline (LM_Mem + LM_Rec) that cleanly separates memory management from recommendation reasoning.
- Collaborative Memory Graph – Builds a dynamic, graph‑structured semantic memory that aggregates user‑item interactions across the whole platform, enabling “agentic” LLMs to leverage collective preferences.
- Cost‑Effective Retrieval & Propagation – Proposes an asynchronous graph‑propagation mechanism that updates the memory in the background, dramatically reducing per‑request latency and inference cost.
- Privacy‑Friendly Deployment – Shows that the framework can run with locally hosted open‑source LLMs, keeping user data off the cloud while preserving recommendation quality.
- State‑of‑the‑Art Results – Empirically beats existing LLM‑based recommenders on four public benchmarks, establishing a new Pareto frontier of accuracy vs. cost vs. privacy.
- Open‑Source Release – Provides code and a demo site, encouraging reproducibility and community extensions.
Methodology
-
Memory Construction (LM_Mem)
- A lightweight language model ingests raw interaction logs (clicks, ratings, timestamps) and encodes them into node embeddings.
- These embeddings are linked into a collaborative memory graph where edges capture co‑occurrence, similarity, or temporal proximity.
- LM_Mem runs asynchronous graph propagation (e.g., lightweight message passing) to keep the graph up‑to‑date without blocking recommendation requests.
-
Context Synthesis
- When a user query arrives, LM_Mem performs a cost‑aware retrieval: it selects a small, high‑signal subgraph (a few hundred nodes) most relevant to the user’s current context.
- The retrieved subgraph is serialized into a concise textual prompt (e.g., “User A liked items X, Y; similar users liked Z…”) and handed to the second model.
-
Reasoning (LM_Rec)
- A larger, possibly more powerful LLM (e.g., GPT‑4, Llama‑2) receives the synthesized prompt and generates the final recommendation list, optionally explaining its reasoning.
- Because the prompt already contains distilled collaborative knowledge, LM_Rec can stay “agentic” (performing chain‑of‑thought reasoning) without needing to process the full graph.
-
Training & Fine‑Tuning
- LM_Mem is fine‑tuned on a contrastive objective to produce embeddings that preserve collaborative signals.
- LM_Rec is fine‑tuned on a standard recommendation loss (e.g., cross‑entropy over next‑item prediction) using the prompts generated by LM_Mem.
The overall pipeline is modular: swapping either component with a different model or scaling them independently is straightforward.
Results & Findings
| Dataset | Metric (HR@10) | MemRec | Best Prior LLM‑Rec | % Gain |
|---|---|---|---|---|
| Amazon‑Books | 0.421 | 0.452 | 0.418 | +8.1% |
| MovieLens‑1M | 0.389 | 0.415 | 0.382 | +8.6% |
| Yelp | 0.337 | 0.361 | 0.333 | +8.4% |
| 0.274 | 0.298 | 0.267 | +11.5% |
- Inference Cost: MemRec reduces average GPU memory usage by ~45 % compared with monolithic LLM‑only recommenders because LM_Rec sees a much shorter prompt.
- Latency: End‑to‑end response time drops from ~300 ms to ~180 ms on a single A100, meeting real‑time service SLAs.
- Privacy: Experiments with a fully local Llama‑2‑13B model achieve only a 2–3 % drop in HR@10 relative to the cloud‑based GPT‑4 baseline, demonstrating that high performance is possible without sending data to external APIs.
Ablation studies confirm that (i) the asynchronous graph updates are essential for freshness, and (ii) the decoupling yields a better trade‑off than simply enlarging the prompt size.
Practical Implications
- Scalable Agentic Recommenders – Companies can adopt LLM‑driven recommendation services without the prohibitive cost of feeding the entire interaction graph into the model each time.
- Edge & On‑Device Deployments – Because LM_Mem can run on modest hardware and LM_Rec can be swapped for an open‑source model, MemRec enables privacy‑preserving recommendation on phones, browsers, or IoT devices.
- Rapid Knowledge Refresh – Background graph propagation means new user actions are reflected in recommendations within seconds, crucial for fast‑moving domains like news or e‑commerce flash sales.
- Modular Upgrade Path – Teams can experiment with better retrieval strategies or newer LLMs independently, shortening the R&D cycle.
- Cost Savings – Lower GPU memory and compute per request translate directly into reduced cloud spend, making LLM‑based recommendation viable for mid‑size platforms.
Limitations & Future Work
- Graph Size Explosion – While LM_Mem mitigates runtime cost, the underlying memory graph still grows linearly with user‑item interactions; efficient pruning or hierarchical summarization remains an open challenge.
- Cold‑Start for New Items – Items with few interactions rely heavily on content features; the current setup may underperform when textual metadata is sparse.
- Prompt Engineering Sensitivity – The quality of LM_Rec’s output depends on how LM_Mem formats the subgraph; more robust, possibly learned, prompt templates could improve stability.
- Evaluation on Real‑World Traffic – Benchmarks are static; deploying MemRec in a live A/B test would reveal latency spikes, cache effects, and user satisfaction metrics not captured in offline experiments.
Future research directions suggested by the authors include hierarchical memory graphs, adaptive retrieval budgets based on request urgency, and tighter integration of reinforcement learning to continuously align the collaborative memory with business objectives.
Authors
- Weixin Chen
- Yuhan Zhao
- Jingyuan Huang
- Zihe Ye
- Clark Mingxuan Ju
- Tong Zhao
- Neil Shah
- Li Chen
- Yongfeng Zhang
Paper Information
- arXiv ID: 2601.08816v1
- Categories: cs.IR, cs.AI
- Published: January 13, 2026
- PDF: Download PDF