[Paper] MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory
Source: arXiv - 2601.03192v1
Overview
MemRL introduces a new way for large language model (LLM) agents to learn on the fly by treating their episodic memory as a reinforcement‑learning (RL) playground. Instead of repeatedly fine‑tuning the massive model (which is costly and forgets old skills), MemRL keeps the LLM frozen and lets a lightweight, non‑parametric memory module evolve its retrieval policy through trial‑and‑error feedback from the environment. The result is an agent that can continuously improve its problem‑solving repertoire at runtime.
Key Contributions
- Two‑Phase Retrieval: First filters memory entries by semantic similarity, then ranks the remaining candidates with learned Q‑values that reflect their utility for the current task.
- Non‑Parametric RL on Memory: Applies classic Q‑learning updates directly to the episodic memory store, sidestepping expensive gradient‑based fine‑tuning.
- Stability‑Plasticity Separation: Keeps the LLM’s reasoning core frozen (stable) while allowing the memory to adapt (plastic), eliminating catastrophic forgetting.
- Broad Benchmark Validation: Shows consistent gains on diverse suites—HLE (human‑level evaluation), BigCodeBench (code generation), ALFWorld (interactive simulation), and Lifelong Agent Bench (continual learning).
- Runtime Self‑Evolution: Demonstrates that agents can improve during deployment without any weight updates, purely by refining memory utilities.
Methodology
- Frozen LLM Backbone – The large language model is loaded once and never updated; it provides deterministic, high‑quality reasoning and generation.
- Episodic Memory Store – A database of past interaction tuples ⟨state, action, reward, next‑state⟩ is maintained. Each entry is indexed by a semantic embedding (e.g., using the LLM’s own encoder).
- Two‑Phase Retrieval
- Phase 1 – Semantic Filtering: Given a new query, retrieve the top‑k memory entries whose embeddings are closest to the query embedding.
- Phase 2 – Utility Ranking: For the filtered set, compute a Q‑value for each entry using a lightweight Q‑network (or even a tabular estimator). The entry with the highest Q‑value is selected as the “suggested action”.
- Runtime RL Loop
- The agent executes the suggested action in the environment, observes the reward, and records the transition back into memory.
- Q‑values are updated via standard Q‑learning (e.g., TD‑error) using the observed reward and the max‑Q of the next state.
- Over time, high‑reward strategies acquire larger Q‑values, while noisy or low‑value memories are demoted.
- Continuous Deployment – Because only the memory and its Q‑values change, the system can run indefinitely on a production server without re‑training the massive LLM.
Results & Findings
| Benchmark | Baseline (static memory) | MemRL | Relative Gain |
|---|---|---|---|
| HLE (language tasks) | 68.2 % | 77.5 % | +13.6 % |
| BigCodeBench (code generation) | 45.1 % | 58.3 % | +29.4 % |
| ALFWorld (interactive navigation) | 52.8 % | 64.9 % | +22.9 % |
| Lifelong Agent Bench (continual) | 61.4 % | 73.2 % | +19.2 % |
- Stability: The frozen LLM’s performance on earlier tasks never degrades, confirming the absence of catastrophic forgetting.
- Plasticity: Q‑values converge within a few hundred interactions, enabling rapid adaptation to new task distributions.
- Ablation: Removing Phase 2 (utility ranking) drops performance by ~10 %, highlighting the importance of learned Q‑values over pure semantic similarity.
Practical Implications
- Deploy‑time Skill Growth: SaaS products that embed LLM agents (e.g., code assistants, chatbots, autonomous UI agents) can now improve from real user interactions without costly model retraining pipelines.
- Cost‑Effective Continual Learning: Companies can avoid GPU‑intensive fine‑tuning cycles; the memory‑only RL updates run on CPUs or modest GPUs, dramatically lowering operational expenses.
- Safety & Auditing: Since the core LLM never changes, its baseline behavior remains auditable and verifiable, while the mutable memory can be inspected, logged, and rolled back if undesirable strategies emerge.
- Domain‑Specific Adaptation: Teams can seed the episodic memory with proprietary examples (e.g., internal APIs, coding conventions) and let the agent refine its usage over time, achieving a “personalized LLM” without exposing proprietary data to the model weights.
Limitations & Future Work
- Memory Scalability: As the number of episodes grows, retrieval latency can increase; efficient indexing (e.g., IVF‑PQ) or memory pruning strategies are needed for long‑running services.
- Reward Design: The framework relies on well‑shaped reward signals; sparse or noisy rewards can slow Q‑value convergence, suggesting a need for reward shaping or auxiliary learning signals.
- Generalization Beyond Retrieval: MemRL excels when the solution can be assembled from past examples; tasks requiring fundamentally novel reasoning may still need parameter updates.
- Future Directions: The authors propose integrating meta‑RL to adapt the Q‑learning hyper‑parameters on the fly, exploring hierarchical memory structures for multi‑step planning, and extending the approach to multimodal agents (vision‑language).
Authors
- Shengtao Zhang
- Jiaqian Wang
- Ruiwen Zhou
- Junwei Liao
- Yuchen Feng
- Weinan Zhang
- Ying Wen
- Zhiyu Li
- Feiyu Xiong
- Yutao Qi
- Bo Tang
- Muning Wen
Paper Information
- arXiv ID: 2601.03192v1
- Categories: cs.CL
- Published: January 6, 2026
- PDF: Download PDF