[Paper] MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

Published: 1 day ago (February 2, 2026 at 01:53 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.02474v1

Overview

The paper introduces MemSkill, a novel framework that treats an LLM agent’s memory operations as learnable, reusable skills rather than fixed, hand‑crafted functions. By letting the system discover and evolve how it extracts, consolidates, and prunes information from long interaction histories, MemSkill achieves more flexible and efficient memory management, boosting downstream task performance.

Key Contributions

Skill‑based memory architecture – Recasts memory extraction, consolidation, and pruning as modular “skills” that can be selected and executed on demand.
Closed‑loop learning loop – Combines a controller (skill selector), an executor (LLM that applies the chosen skill), and a designer (automated reviewer that creates/refines skills from failure cases).
Self‑evolving skill set – The designer continuously expands the skill repertoire, enabling the agent to adapt to new interaction patterns without manual redesign.
Empirical validation – Demonstrates consistent gains on four benchmarks (LoCoMo, LongMemEval, HotpotQA, ALFWorld) compared with strong static‑memory baselines.
Analysis of skill evolution – Provides qualitative and quantitative insights into how the skill library grows and specializes over training iterations.

Methodology

Skill Library – Each skill is a short prompt template that tells the LLM what to do (e.g., “summarize the last 5 user turns”, “merge overlapping facts”, “drop stale entries”).
Controller – A lightweight policy network (often a small transformer or MLP) that reads the current interaction context and picks the top‑k most relevant skills from the library.
Executor – A large language model (e.g., GPT‑4‑style) that receives the selected skill prompts together with the raw interaction trace and generates the updated memory representation.
Designer – After each episode, the system checks whether the produced memory satisfies a set of verification criteria (completeness, correctness). When a failure is detected, the designer synthesizes new skill prompts or refines existing ones using the LLM itself, then adds them to the library.
Training Loop – The controller is trained with reinforcement learning (policy gradient) using task reward signals, while the designer operates in a separate, periodic “review” phase. The whole pipeline runs iteratively, allowing both the selection policy and the skill set to improve together.

Results & Findings

Benchmark	Baseline (static memory)	MemSkill	Relative ↑
LoCoMo (long‑context reasoning)	68.2%	74.9%	+9.8%
LongMemEval (memory recall)	61.5%	68.3%	+11.1%
HotpotQA (multi‑hop QA)	73.0%	78.6%	+7.6%
ALFWorld (embodied task)	55.4%	62.1%	+12.1%

Skill selection quickly converges: after ~200 k steps the controller reliably picks the most useful 2–3 skills per turn.
Skill growth: the designer adds ~0.5 new skill per 10 k steps, with later iterations focusing on niche cases (e.g., “detect contradictory statements”).
Memory efficiency: average memory size shrinks by ~30 % compared to a naïve sliding‑window approach, while preserving or improving task accuracy.

Practical Implications

Scalable agents – Developers can plug MemSkill into existing LLM‑based assistants to handle arbitrarily long conversation histories without exploding token budgets.
Domain adaptation – Because skills are learned from data, a team can let the designer discover domain‑specific memory operations (e.g., “track order status” for e‑commerce bots) without hand‑coding them.
Reduced engineering overhead – The closed‑loop system automates the tedious process of refining memory heuristics, freeing engineers to focus on higher‑level behavior.
Better user experience – More accurate recall and less “forgetting” translates to smoother multi‑turn interactions, especially in support, tutoring, or planning applications.

Limitations & Future Work

Skill explosion risk – Without careful pruning, the skill library can grow large, potentially slowing down the controller’s selection step.
Verification dependence – The designer’s ability to generate useful new skills hinges on the quality of the automatic correctness checks; noisy signals may lead to suboptimal skill proposals.
Compute cost – Running an LLM executor for each selected skill adds latency; future work could explore lightweight executor variants or caching mechanisms.
Generalization to non‑text modalities – Current experiments focus on textual traces; extending MemSkill to multimodal agents (vision, robotics) remains an open challenge.

Overall, MemSkill points toward a new paradigm where an LLM agent’s memory is not a static data structure but a dynamic, self‑optimizing skill set—opening the door to more adaptable and long‑running AI assistants.

Authors

Haozhen Zhang
Quanyu Long
Jianzhu Bao
Tao Feng
Weizhi Zhang
Haodong Yue
Wenya Wang

Paper Information

arXiv ID: 2602.02474v1
Categories: cs.CL, cs.AI, cs.LG
Published: February 2, 2026
PDF: Download PDF

[Paper] MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Reward-free Alignment for Conflicting Objectives

[Paper] RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System

[Paper] RE-TRAC: REcursive TRAjectory Compression for Deep Search Agents

[Paper] SPARKLING: Balancing Signal Preservation and Symmetry Breaking for Width-Progressive Learning