[Paper] MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory

Published: (January 6, 2026 at 12:14 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.03192v1

Overview

MemRL introduces a new way for large language model (LLM) agents to learn on the fly by treating their episodic memory as a reinforcement‑learning (RL) playground. Instead of repeatedly fine‑tuning the massive model (which is costly and forgets old skills), MemRL keeps the LLM frozen and lets a lightweight, non‑parametric memory module evolve its retrieval policy through trial‑and‑error feedback from the environment. The result is an agent that can continuously improve its problem‑solving repertoire at runtime.

Key Contributions

  • Two‑Phase Retrieval: First filters memory entries by semantic similarity, then ranks the remaining candidates with learned Q‑values that reflect their utility for the current task.
  • Non‑Parametric RL on Memory: Applies classic Q‑learning updates directly to the episodic memory store, sidestepping expensive gradient‑based fine‑tuning.
  • Stability‑Plasticity Separation: Keeps the LLM’s reasoning core frozen (stable) while allowing the memory to adapt (plastic), eliminating catastrophic forgetting.
  • Broad Benchmark Validation: Shows consistent gains on diverse suites—HLE (human‑level evaluation), BigCodeBench (code generation), ALFWorld (interactive simulation), and Lifelong Agent Bench (continual learning).
  • Runtime Self‑Evolution: Demonstrates that agents can improve during deployment without any weight updates, purely by refining memory utilities.

Methodology

  1. Frozen LLM Backbone – The large language model is loaded once and never updated; it provides deterministic, high‑quality reasoning and generation.
  2. Episodic Memory Store – A database of past interaction tuples ⟨state, action, reward, next‑state⟩ is maintained. Each entry is indexed by a semantic embedding (e.g., using the LLM’s own encoder).
  3. Two‑Phase Retrieval
    • Phase 1 – Semantic Filtering: Given a new query, retrieve the top‑k memory entries whose embeddings are closest to the query embedding.
    • Phase 2 – Utility Ranking: For the filtered set, compute a Q‑value for each entry using a lightweight Q‑network (or even a tabular estimator). The entry with the highest Q‑value is selected as the “suggested action”.
  4. Runtime RL Loop
    • The agent executes the suggested action in the environment, observes the reward, and records the transition back into memory.
    • Q‑values are updated via standard Q‑learning (e.g., TD‑error) using the observed reward and the max‑Q of the next state.
    • Over time, high‑reward strategies acquire larger Q‑values, while noisy or low‑value memories are demoted.
  5. Continuous Deployment – Because only the memory and its Q‑values change, the system can run indefinitely on a production server without re‑training the massive LLM.

Results & Findings

BenchmarkBaseline (static memory)MemRLRelative Gain
HLE (language tasks)68.2 %77.5 %+13.6 %
BigCodeBench (code generation)45.1 %58.3 %+29.4 %
ALFWorld (interactive navigation)52.8 %64.9 %+22.9 %
Lifelong Agent Bench (continual)61.4 %73.2 %+19.2 %
  • Stability: The frozen LLM’s performance on earlier tasks never degrades, confirming the absence of catastrophic forgetting.
  • Plasticity: Q‑values converge within a few hundred interactions, enabling rapid adaptation to new task distributions.
  • Ablation: Removing Phase 2 (utility ranking) drops performance by ~10 %, highlighting the importance of learned Q‑values over pure semantic similarity.

Practical Implications

  • Deploy‑time Skill Growth: SaaS products that embed LLM agents (e.g., code assistants, chatbots, autonomous UI agents) can now improve from real user interactions without costly model retraining pipelines.
  • Cost‑Effective Continual Learning: Companies can avoid GPU‑intensive fine‑tuning cycles; the memory‑only RL updates run on CPUs or modest GPUs, dramatically lowering operational expenses.
  • Safety & Auditing: Since the core LLM never changes, its baseline behavior remains auditable and verifiable, while the mutable memory can be inspected, logged, and rolled back if undesirable strategies emerge.
  • Domain‑Specific Adaptation: Teams can seed the episodic memory with proprietary examples (e.g., internal APIs, coding conventions) and let the agent refine its usage over time, achieving a “personalized LLM” without exposing proprietary data to the model weights.

Limitations & Future Work

  • Memory Scalability: As the number of episodes grows, retrieval latency can increase; efficient indexing (e.g., IVF‑PQ) or memory pruning strategies are needed for long‑running services.
  • Reward Design: The framework relies on well‑shaped reward signals; sparse or noisy rewards can slow Q‑value convergence, suggesting a need for reward shaping or auxiliary learning signals.
  • Generalization Beyond Retrieval: MemRL excels when the solution can be assembled from past examples; tasks requiring fundamentally novel reasoning may still need parameter updates.
  • Future Directions: The authors propose integrating meta‑RL to adapt the Q‑learning hyper‑parameters on the fly, exploring hierarchical memory structures for multi‑step planning, and extending the approach to multimodal agents (vision‑language).

Authors

  • Shengtao Zhang
  • Jiaqian Wang
  • Ruiwen Zhou
  • Junwei Liao
  • Yuchen Feng
  • Weinan Zhang
  • Ying Wen
  • Zhiyu Li
  • Feiyu Xiong
  • Yutao Qi
  • Bo Tang
  • Muning Wen

Paper Information

  • arXiv ID: 2601.03192v1
  • Categories: cs.CL
  • Published: January 6, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »