[Paper] ParamMem: Augmenting Language Agents with Parametric Reflective Memory

Published: (February 26, 2026 at 01:28 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.23320v1

Overview

The paper introduces ParamMem, a new “parametric memory” component that lets language‑based agents remember how they have reflected on past problems and reuse those patterns to generate richer, more diverse self‑reflections. By coupling ParamMem with traditional episodic (short‑term) and cross‑sample (long‑term) memories, the authors build ParamAgent, a framework that consistently boosts performance on code generation, math reasoning, and multi‑hop QA tasks.

Key Contributions

  • Parametric Reflective Memory (ParamMem): a lightweight module that stores reflection patterns directly in model parameters, enabling temperature‑controlled sampling of diverse self‑feedback.
  • ParamAgent framework: integrates ParamMem with episodic and cross‑sample memories, creating a unified architecture for iterative self‑reflection.
  • Empirical link between reflective diversity and success: systematic analysis shows that higher diversity in reflection signals strongly correlates with task accuracy.
  • Strong, sample‑efficient gains: across three benchmark suites (code generation, mathematical reasoning, multi‑hop QA) ParamAgent outperforms the previous state‑of‑the‑art reflective agents by 3–9 % absolute.
  • Cross‑scale transfer: a small ParamMem trained on a modest model can be transplanted to larger models, delivering immediate performance lifts without additional data.
  • Self‑improvement without stronger external models: the agent can bootstrap its own reasoning ability, reducing reliance on expensive “teacher” models.

Methodology

  1. Reflection Generation Loop – The agent solves a problem, then asks itself “What went wrong?” and generates a textual reflection. This loop repeats until a stopping criterion (e.g., confidence threshold) is met.
  2. ParamMem Design – Instead of storing reflections as raw text, ParamMem encodes patterns of useful reflections into a small set of trainable vectors (the “parametric memory”). During inference, the agent samples from these vectors using a temperature parameter; higher temperature yields more varied reflections.
  3. Memory Fusion
    • Episodic Memory: short‑term cache of the current problem’s intermediate steps.
    • Cross‑Sample Memory: a datastore of reflections from previous examples (retrieved via similarity search).
    • ParamMem: the learned, model‑internal source of diverse reflection signals.
      The three are concatenated to the language model’s context before each reflection step.
  4. Training – The base language model (e.g., GPT‑Neo, LLaMA) is frozen. Only the ParamMem vectors and a lightweight projection layer are trained on a mixture of reflection‑augmented examples. The loss encourages the generated reflection to improve downstream answer accuracy.
  5. Evaluation – The authors test on:
    • HumanEval (code generation)
    • MATH (grade‑school math)
    • HotpotQA (multi‑hop question answering)
      Metrics include exact match / pass@k for code, accuracy for math, and F1/EM for QA.

Results & Findings

BenchmarkBaseline (no reflection)Prior reflective agentParamAgent
HumanEval (pass@1)38.2 %41.7 %45.9 %
MATH (accuracy)28.4 %31.1 %35.6 %
HotpotQA (EM)62.3 %66.0 %70.8 %
  • Reflective Diversity Matters: Pearson r ≈ 0.78 between diversity score (entropy of sampled reflections) and task success.
  • Sample Efficiency: With only 5 k annotated reflections, ParamMem reaches >90 % of its final performance; adding more data yields diminishing returns.
  • Weak‑to‑Strong Transfer: A ParamMem trained on a 1.3 B‑parameter model improves a 7 B‑parameter model by +4 % accuracy, demonstrating that the learned reflection patterns are model‑agnostic.
  • Self‑Improvement Loop: After a few self‑reflection cycles, the agent’s answer quality surpasses that of a stronger “teacher” model that provided the initial reflections, confirming the bootstrap capability.

Practical Implications

  • Developer Tooling: IDE plugins that embed ParamAgent could suggest richer debugging hints or alternative implementations during code generation, reducing the need for repeated manual prompting.
  • Low‑Cost Reasoning Services: SaaS platforms can deploy a modest‑size LLM with ParamMem to achieve performance comparable to larger, more expensive models, cutting cloud compute bills.
  • Continuous Learning Systems: Because ParamMem can be updated with a handful of new reflection examples, products can adapt to domain‑specific quirks (e.g., finance‑oriented math) without full model retraining.
  • Safety & Explainability: Diverse self‑reflections expose failure modes early, enabling automated filters to catch hallucinations before they reach end‑users.
  • Cross‑Model Portability: Teams can train a single ParamMem once and ship it to multiple model back‑ends (open‑source or proprietary), simplifying maintenance.

Limitations & Future Work

  • Memory Size vs. Diversity Trade‑off: ParamMem’s capacity is bounded; extremely diverse tasks may exhaust its expressive power, requiring hierarchical or dynamic memory expansion.
  • Reliance on Quality Reflections: The training data still needs high‑quality human‑written reflections; noisy or biased reflections can degrade performance.
  • Evaluation Scope: Experiments focus on well‑structured benchmarks; real‑world conversational agents with open‑ended dialogues remain untested.
  • Future Directions: The authors suggest (1) scaling ParamMem with sparse‑update techniques, (2) integrating reinforcement learning from human feedback to refine reflection policies, and (3) exploring multimodal reflections (e.g., visual debugging hints) for code‑centric agents.

Authors

  • Tianjun Yao
  • Yongqiang Chen
  • Yujia Zheng
  • Pan Li
  • Zhiqiang Shen
  • Kun Zhang

Paper Information

  • arXiv ID: 2602.23320v1
  • Categories: cs.LG, cs.MA
  • Published: February 26, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Model Agreement via Anchoring

Numerous lines of aim to control model disagreement -- the extent to which two machine learning models disagree in their predictions. We adopt a simple and stan...

[Paper] A Dataset is Worth 1 MB

A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on divers...