[Paper] MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory

Published: 1 month ago (January 6, 2026 at 12:14 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.03192v1

Overview

MemRL introduces a new way for large language model (LLM) agents to learn on the fly by treating their episodic memory as a reinforcement‑learning (RL) playground. Instead of repeatedly fine‑tuning the massive model (which is costly and forgets old skills), MemRL keeps the LLM frozen and lets a lightweight, non‑parametric memory module evolve its retrieval policy through trial‑and‑error feedback from the environment. The result is an agent that can continuously improve its problem‑solving repertoire at runtime.

Key Contributions

Two‑Phase Retrieval: First filters memory entries by semantic similarity, then ranks the remaining candidates with learned Q‑values that reflect their utility for the current task.
Non‑Parametric RL on Memory: Applies classic Q‑learning updates directly to the episodic memory store, sidestepping expensive gradient‑based fine‑tuning.
Stability‑Plasticity Separation: Keeps the LLM’s reasoning core frozen (stable) while allowing the memory to adapt (plastic), eliminating catastrophic forgetting.
Broad Benchmark Validation: Shows consistent gains on diverse suites—HLE (human‑level evaluation), BigCodeBench (code generation), ALFWorld (interactive simulation), and Lifelong Agent Bench (continual learning).
Runtime Self‑Evolution: Demonstrates that agents can improve during deployment without any weight updates, purely by refining memory utilities.

Methodology

Frozen LLM Backbone – The large language model is loaded once and never updated; it provides deterministic, high‑quality reasoning and generation.
Episodic Memory Store – A database of past interaction tuples ⟨state, action, reward, next‑state⟩ is maintained. Each entry is indexed by a semantic embedding (e.g., using the LLM’s own encoder).
Two‑Phase Retrieval
- Phase 1 – Semantic Filtering: Given a new query, retrieve the top‑k memory entries whose embeddings are closest to the query embedding.
- Phase 2 – Utility Ranking: For the filtered set, compute a Q‑value for each entry using a lightweight Q‑network (or even a tabular estimator). The entry with the highest Q‑value is selected as the “suggested action”.
Runtime RL Loop
- The agent executes the suggested action in the environment, observes the reward, and records the transition back into memory.
- Q‑values are updated via standard Q‑learning (e.g., TD‑error) using the observed reward and the max‑Q of the next state.
- Over time, high‑reward strategies acquire larger Q‑values, while noisy or low‑value memories are demoted.
Continuous Deployment – Because only the memory and its Q‑values change, the system can run indefinitely on a production server without re‑training the massive LLM.

Results & Findings

Benchmark	Baseline (static memory)	MemRL	Relative Gain
HLE (language tasks)	68.2 %	77.5 %	+13.6 %
BigCodeBench (code generation)	45.1 %	58.3 %	+29.4 %
ALFWorld (interactive navigation)	52.8 %	64.9 %	+22.9 %
Lifelong Agent Bench (continual)	61.4 %	73.2 %	+19.2 %

Stability: The frozen LLM’s performance on earlier tasks never degrades, confirming the absence of catastrophic forgetting.
Plasticity: Q‑values converge within a few hundred interactions, enabling rapid adaptation to new task distributions.
Ablation: Removing Phase 2 (utility ranking) drops performance by ~10 %, highlighting the importance of learned Q‑values over pure semantic similarity.

Practical Implications

Deploy‑time Skill Growth: SaaS products that embed LLM agents (e.g., code assistants, chatbots, autonomous UI agents) can now improve from real user interactions without costly model retraining pipelines.
Cost‑Effective Continual Learning: Companies can avoid GPU‑intensive fine‑tuning cycles; the memory‑only RL updates run on CPUs or modest GPUs, dramatically lowering operational expenses.
Safety & Auditing: Since the core LLM never changes, its baseline behavior remains auditable and verifiable, while the mutable memory can be inspected, logged, and rolled back if undesirable strategies emerge.
Domain‑Specific Adaptation: Teams can seed the episodic memory with proprietary examples (e.g., internal APIs, coding conventions) and let the agent refine its usage over time, achieving a “personalized LLM” without exposing proprietary data to the model weights.

Limitations & Future Work

Memory Scalability: As the number of episodes grows, retrieval latency can increase; efficient indexing (e.g., IVF‑PQ) or memory pruning strategies are needed for long‑running services.
Reward Design: The framework relies on well‑shaped reward signals; sparse or noisy rewards can slow Q‑value convergence, suggesting a need for reward shaping or auxiliary learning signals.
Generalization Beyond Retrieval: MemRL excels when the solution can be assembled from past examples; tasks requiring fundamentally novel reasoning may still need parameter updates.
Future Directions: The authors propose integrating meta‑RL to adapt the Q‑learning hyper‑parameters on the fly, exploring hierarchical memory structures for multi‑step planning, and extending the approach to multimodal agents (vision‑language).

Authors

Shengtao Zhang
Jiaqian Wang
Ruiwen Zhou
Junwei Liao
Yuchen Feng
Weinan Zhang
Ying Wen
Zhiyu Li
Feiyu Xiong
Yutao Qi
Bo Tang
Muning Wen

Paper Information

arXiv ID: 2601.03192v1
Categories: cs.CL
Published: January 6, 2026
PDF: Download PDF

[Paper] MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs

[Paper] Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards

[Paper] Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks

[Paper] The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning