[Paper] MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

Published: (February 2, 2026 at 01:53 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.02474v1

Overview

The paper introduces MemSkill, a novel framework that treats an LLM agent’s memory operations as learnable, reusable skills rather than fixed, hand‑crafted functions. By letting the system discover and evolve how it extracts, consolidates, and prunes information from long interaction histories, MemSkill achieves more flexible and efficient memory management, boosting downstream task performance.

Key Contributions

  • Skill‑based memory architecture – Recasts memory extraction, consolidation, and pruning as modular “skills” that can be selected and executed on demand.
  • Closed‑loop learning loop – Combines a controller (skill selector), an executor (LLM that applies the chosen skill), and a designer (automated reviewer that creates/refines skills from failure cases).
  • Self‑evolving skill set – The designer continuously expands the skill repertoire, enabling the agent to adapt to new interaction patterns without manual redesign.
  • Empirical validation – Demonstrates consistent gains on four benchmarks (LoCoMo, LongMemEval, HotpotQA, ALFWorld) compared with strong static‑memory baselines.
  • Analysis of skill evolution – Provides qualitative and quantitative insights into how the skill library grows and specializes over training iterations.

Methodology

  1. Skill Library – Each skill is a short prompt template that tells the LLM what to do (e.g., “summarize the last 5 user turns”, “merge overlapping facts”, “drop stale entries”).
  2. Controller – A lightweight policy network (often a small transformer or MLP) that reads the current interaction context and picks the top‑k most relevant skills from the library.
  3. Executor – A large language model (e.g., GPT‑4‑style) that receives the selected skill prompts together with the raw interaction trace and generates the updated memory representation.
  4. Designer – After each episode, the system checks whether the produced memory satisfies a set of verification criteria (completeness, correctness). When a failure is detected, the designer synthesizes new skill prompts or refines existing ones using the LLM itself, then adds them to the library.
  5. Training Loop – The controller is trained with reinforcement learning (policy gradient) using task reward signals, while the designer operates in a separate, periodic “review” phase. The whole pipeline runs iteratively, allowing both the selection policy and the skill set to improve together.

Results & Findings

BenchmarkBaseline (static memory)MemSkillRelative ↑
LoCoMo (long‑context reasoning)68.2%74.9%+9.8%
LongMemEval (memory recall)61.5%68.3%+11.1%
HotpotQA (multi‑hop QA)73.0%78.6%+7.6%
ALFWorld (embodied task)55.4%62.1%+12.1%
  • Skill selection quickly converges: after ~200 k steps the controller reliably picks the most useful 2–3 skills per turn.
  • Skill growth: the designer adds ~0.5 new skill per 10 k steps, with later iterations focusing on niche cases (e.g., “detect contradictory statements”).
  • Memory efficiency: average memory size shrinks by ~30 % compared to a naïve sliding‑window approach, while preserving or improving task accuracy.

Practical Implications

  • Scalable agents – Developers can plug MemSkill into existing LLM‑based assistants to handle arbitrarily long conversation histories without exploding token budgets.
  • Domain adaptation – Because skills are learned from data, a team can let the designer discover domain‑specific memory operations (e.g., “track order status” for e‑commerce bots) without hand‑coding them.
  • Reduced engineering overhead – The closed‑loop system automates the tedious process of refining memory heuristics, freeing engineers to focus on higher‑level behavior.
  • Better user experience – More accurate recall and less “forgetting” translates to smoother multi‑turn interactions, especially in support, tutoring, or planning applications.

Limitations & Future Work

  • Skill explosion risk – Without careful pruning, the skill library can grow large, potentially slowing down the controller’s selection step.
  • Verification dependence – The designer’s ability to generate useful new skills hinges on the quality of the automatic correctness checks; noisy signals may lead to suboptimal skill proposals.
  • Compute cost – Running an LLM executor for each selected skill adds latency; future work could explore lightweight executor variants or caching mechanisms.
  • Generalization to non‑text modalities – Current experiments focus on textual traces; extending MemSkill to multimodal agents (vision, robotics) remains an open challenge.

Overall, MemSkill points toward a new paradigm where an LLM agent’s memory is not a static data structure but a dynamic, self‑optimizing skill set—opening the door to more adaptable and long‑running AI assistants.

Authors

  • Haozhen Zhang
  • Quanyu Long
  • Jianzhu Bao
  • Tao Feng
  • Weizhi Zhang
  • Haodong Yue
  • Wenya Wang

Paper Information

  • arXiv ID: 2602.02474v1
  • Categories: cs.CL, cs.AI, cs.LG
  • Published: February 2, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »