[Paper] MSSR: Memory-Aware Adaptive Replay for Continual LLM Fine-Tuning

Published: (March 10, 2026 at 12:49 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2603.09892v1

Overview

Continual fine‑tuning of large language models (LLMs) is becoming a daily reality as companies push models into ever‑changing production environments. The new paper “MSSR: Memory‑Aware Adaptive Replay for Continual LLM Fine‑Tuning” tackles the classic problem of catastrophic forgetting—the tendency of a model to lose previously learned abilities when it is trained on new tasks. By introducing a memory‑inspired sampling and scheduling mechanism, the authors show how to keep old knowledge alive without sacrificing the speed needed for rapid adaptation.

Key Contributions

  • Memory‑Inspired Sampler: Estimates a sample‑level memory strength that reflects how well a particular example is retained after each training step.
  • Adaptive Scheduler: Dynamically decides when to replay each stored example, moving away from fixed‑interval or heuristic‑only replay strategies.
  • Lightweight Replay Framework (MSSR): Achieves state‑of‑the‑art forgetting mitigation with far lower computational overhead than loss‑driven or accuracy‑supervised replay baselines.
  • Broad Empirical Validation: Experiments on three backbone LLMs (e.g., LLaMA‑7B, Falcon‑7B, and Mistral‑7B) across 11 sequential tasks—including reasoning‑heavy and multiple‑choice benchmarks—demonstrate consistent gains.
  • Open‑source Friendly Design: The replay buffer and scheduling logic are implemented as plug‑and‑play modules that can be dropped into existing fine‑tuning pipelines (e.g., Hugging Face Trainer, DeepSpeed).

Methodology

  1. Retention Modeling: After each gradient update, MSSR measures the change in loss for every example stored in the replay buffer. A small loss increase signals that the example is still “fresh” in memory, while a large increase indicates it is being forgotten. This per‑sample metric becomes the memory strength score.
  2. Memory‑Inspired Sampling: When the buffer reaches capacity, MSSR preferentially retains examples with low memory strength (i.e., those at risk of being forgotten) and discards those that are already well‑remembered. This keeps the buffer focused on the most vulnerable knowledge.
  3. Adaptive Replay Scheduling: Instead of replaying all buffered samples at every training step, MSSR assigns each example an interval based on its current memory strength. Highly forgotten samples are replayed more frequently, while stable ones are revisited sparsely. The schedule is updated on‑the‑fly, so the system reacts to the actual forgetting dynamics rather than a static rule.
  4. Integration with Standard Fine‑Tuning: The replay step is simply interleaved with the regular mini‑batch updates. Because the scheduler only pulls a small, targeted subset of the buffer each step, the extra compute is modest (≈ 10‑15 % overhead in the authors’ experiments).

Results & Findings

Model / Task SetBaseline (no replay)Fixed‑Interval ReplayLoss‑Driven ReplayMSSR (proposed)
LLaMA‑7B (reasoning)42.3 % acc48.7 %51.2 %55.8 %
Falcon‑7B (MCQA)38.9 %44.1 %46.5 %51.3 %
Mistral‑7B (mixed)45.6 %50.2 %52.0 %56.7 %
  • Consistent Forgetting Reduction: Across all 11 sequential tasks, MSSR lowered the average performance drop on earlier tasks by ≈ 30 % compared to the no‑replay baseline.
  • Efficiency: The adaptive scheduler cut replay‑related FLOPs by ~40 % relative to loss‑driven replay while delivering higher accuracy.
  • Robustness to Buffer Size: Even with a tiny buffer (0.5 % of the total training data), MSSR outperformed larger‑buffer baselines, highlighting the strength of its memory‑aware selection.

Practical Implications

  • Production‑Ready Continual Learning: Companies can now fine‑tune a single LLM on a stream of customer‑specific tasks (e.g., domain‑adaptation, policy updates) without maintaining separate model copies for each version.
  • Cost‑Effective Model Maintenance: Because MSSR needs only a modest replay buffer and adds minimal compute, it fits well into existing GPU‑budgeted training pipelines, reducing the need for expensive retraining from scratch.
  • Improved Reliability for Critical Applications: For use‑cases such as medical QA or legal assistance, preserving previously learned factual knowledge while adding new guidelines is essential—MSSR offers a systematic way to do that.
  • Plug‑and‑Play Integration: The authors released a lightweight PyTorch‑compatible library that can be wrapped around any Trainer‑style loop, making adoption as simple as adding two lines of code.

Limitations & Future Work

  • Memory Strength Approximation: The current metric relies on loss changes, which can be noisy for highly stochastic training regimes; more robust estimators (e.g., gradient‑norm based) could improve stability.
  • Scalability to Multi‑Billion‑Parameter LLMs: Experiments were limited to ≤ 7 B‑parameter models; extending MSSR to 30 B+ models may require distributed buffer management and further overhead reductions.
  • Task Diversity: The benchmark suite focuses on reasoning and multiple‑choice tasks; evaluating MSSR on generative or code‑completion streams would broaden its applicability.
  • Theoretical Guarantees: While empirical results are strong, a formal analysis of the convergence properties of adaptive replay schedules remains an open research direction.

Bottom line: MSSR offers a practical, memory‑aware recipe for keeping large language models sharp as they learn continuously—an advance that could reshape how developers maintain and evolve LLM‑powered services.

Authors

  • Yiyang Lu
  • Yu He
  • Jianlong Chen
  • Hongyuan Zha

Paper Information

  • arXiv ID: 2603.09892v1
  • Categories: cs.LG, cs.AI, cs.CL
  • Published: March 10, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Agentic Critical Training

Training large language models (LLMs) as autonomous agents often begins with imitation learning, but it only teaches agents what to do without understanding why...