[Paper] MSSR: Memory-Aware Adaptive Replay for Continual LLM Fine-Tuning

Published: 14 hours ago (March 10, 2026 at 12:49 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2603.09892v1

Overview

Continual fine‑tuning of large language models (LLMs) is becoming a daily reality as companies push models into ever‑changing production environments. The new paper “MSSR: Memory‑Aware Adaptive Replay for Continual LLM Fine‑Tuning” tackles the classic problem of catastrophic forgetting—the tendency of a model to lose previously learned abilities when it is trained on new tasks. By introducing a memory‑inspired sampling and scheduling mechanism, the authors show how to keep old knowledge alive without sacrificing the speed needed for rapid adaptation.

Key Contributions

Memory‑Inspired Sampler: Estimates a sample‑level memory strength that reflects how well a particular example is retained after each training step.
Adaptive Scheduler: Dynamically decides when to replay each stored example, moving away from fixed‑interval or heuristic‑only replay strategies.
Lightweight Replay Framework (MSSR): Achieves state‑of‑the‑art forgetting mitigation with far lower computational overhead than loss‑driven or accuracy‑supervised replay baselines.
Broad Empirical Validation: Experiments on three backbone LLMs (e.g., LLaMA‑7B, Falcon‑7B, and Mistral‑7B) across 11 sequential tasks—including reasoning‑heavy and multiple‑choice benchmarks—demonstrate consistent gains.
Open‑source Friendly Design: The replay buffer and scheduling logic are implemented as plug‑and‑play modules that can be dropped into existing fine‑tuning pipelines (e.g., Hugging Face Trainer, DeepSpeed).

Methodology

Retention Modeling: After each gradient update, MSSR measures the change in loss for every example stored in the replay buffer. A small loss increase signals that the example is still “fresh” in memory, while a large increase indicates it is being forgotten. This per‑sample metric becomes the memory strength score.
Memory‑Inspired Sampling: When the buffer reaches capacity, MSSR preferentially retains examples with low memory strength (i.e., those at risk of being forgotten) and discards those that are already well‑remembered. This keeps the buffer focused on the most vulnerable knowledge.
Adaptive Replay Scheduling: Instead of replaying all buffered samples at every training step, MSSR assigns each example an interval based on its current memory strength. Highly forgotten samples are replayed more frequently, while stable ones are revisited sparsely. The schedule is updated on‑the‑fly, so the system reacts to the actual forgetting dynamics rather than a static rule.
Integration with Standard Fine‑Tuning: The replay step is simply interleaved with the regular mini‑batch updates. Because the scheduler only pulls a small, targeted subset of the buffer each step, the extra compute is modest (≈ 10‑15 % overhead in the authors’ experiments).

Results & Findings

Model / Task Set	Baseline (no replay)	Fixed‑Interval Replay	Loss‑Driven Replay	MSSR (proposed)
LLaMA‑7B (reasoning)	42.3 % acc	48.7 %	51.2 %	55.8 %
Falcon‑7B (MCQA)	38.9 %	44.1 %	46.5 %	51.3 %
Mistral‑7B (mixed)	45.6 %	50.2 %	52.0 %	56.7 %

Consistent Forgetting Reduction: Across all 11 sequential tasks, MSSR lowered the average performance drop on earlier tasks by ≈ 30 % compared to the no‑replay baseline.
Efficiency: The adaptive scheduler cut replay‑related FLOPs by ~40 % relative to loss‑driven replay while delivering higher accuracy.
Robustness to Buffer Size: Even with a tiny buffer (0.5 % of the total training data), MSSR outperformed larger‑buffer baselines, highlighting the strength of its memory‑aware selection.

Practical Implications

Production‑Ready Continual Learning: Companies can now fine‑tune a single LLM on a stream of customer‑specific tasks (e.g., domain‑adaptation, policy updates) without maintaining separate model copies for each version.
Cost‑Effective Model Maintenance: Because MSSR needs only a modest replay buffer and adds minimal compute, it fits well into existing GPU‑budgeted training pipelines, reducing the need for expensive retraining from scratch.
Improved Reliability for Critical Applications: For use‑cases such as medical QA or legal assistance, preserving previously learned factual knowledge while adding new guidelines is essential—MSSR offers a systematic way to do that.
Plug‑and‑Play Integration: The authors released a lightweight PyTorch‑compatible library that can be wrapped around any Trainer‑style loop, making adoption as simple as adding two lines of code.

Limitations & Future Work

Memory Strength Approximation: The current metric relies on loss changes, which can be noisy for highly stochastic training regimes; more robust estimators (e.g., gradient‑norm based) could improve stability.
Scalability to Multi‑Billion‑Parameter LLMs: Experiments were limited to ≤ 7 B‑parameter models; extending MSSR to 30 B+ models may require distributed buffer management and further overhead reductions.
Task Diversity: The benchmark suite focuses on reasoning and multiple‑choice tasks; evaluating MSSR on generative or code‑completion streams would broaden its applicability.
Theoretical Guarantees: While empirical results are strong, a formal analysis of the convergence properties of adaptive replay schedules remains an open research direction.

Bottom line: MSSR offers a practical, memory‑aware recipe for keeping large language models sharp as they learn continuously—an advance that could reshape how developers maintain and evolve LLM‑powered services.

Authors

Yiyang Lu
Yu He
Jianlong Chen
Hongyuan Zha

Paper Information

arXiv ID: 2603.09892v1
Categories: cs.LG, cs.AI, cs.CL
Published: March 10, 2026
PDF: Download PDF

[Paper] MSSR: Memory-Aware Adaptive Replay for Continual LLM Fine-Tuning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Think Before You Lie: How Reasoning Improves Honesty

[Paper] Agentic Critical Training

[Paper] How Far Can Unsupervised RLVR Scale LLM Training?

[Paper] OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning