[Paper] Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

Published: 3 days ago (May 7, 2026 at 01:57 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.06654v1

Overview

The authors investigate a surprisingly simple yet powerful idea: when you fine‑tune a large language model (LLM), keep using the exact same optimizer you used during pre‑training. Their experiments show that this “optimizer‑model consistency” dramatically reduces catastrophic forgetting while still reaching (or even surpassing) the performance of conventional fine‑tuning pipelines that switch optimizers or rely on parameter‑efficient tricks like LoRA.

Key Contributions

Empirical discovery of optimizer‑model consistency: Full‑parameter fine‑tuning with the pre‑training optimizer consistently forgets less than alternative optimizers or LoRA‑based methods.
Regularization perspective: Demonstrates that optimizers act as implicit regularizers on hidden activations, shaping the loss landscape around the pre‑trained checkpoint.
Theoretical insight: Shows that, given the optimizer‑induced regularization, the optimal fine‑tuning weight updates must follow a specific structure that is naturally produced when the same optimizer is reused.
Optimizer comparison (Muon vs. AdamW): Provides a controlled study revealing that Muon, which encourages rote memorization, harms reasoning‑task fine‑tuning compared to AdamW.
Synthetic language‑model experiment: Isolates the memorization effect and confirms that strong memorization impedes pattern learning when only a small fine‑tuning dataset is available.

Methodology

Controlled fine‑tuning experiments – The authors take several publicly available LLM checkpoints (e.g., GPT‑2‑like models) and fine‑tune them on downstream tasks (classification, reasoning, etc.) using different optimizers: the original pre‑training optimizer (AdamW or Muon) versus a mismatched optimizer and LoRA.
Forgetting measurement – After fine‑tuning, they evaluate the model on a held‑out “pre‑training” test set (e.g., language modeling perplexity) to quantify how much knowledge was lost, alongside the downstream task performance.
Activation regularization analysis – By tracking activation statistics (norms, variance) during pre‑training, they characterize each optimizer’s implicit regularization effect.
Theoretical modeling – They formalize the optimizer’s regularization as a penalty term in the loss and derive conditions under which the fine‑tuning gradient aligns with the pre‑training landscape, minimizing forgetting.
Synthetic memorization benchmark – A toy language modeling dataset is constructed where memorization vs. pattern learning can be measured directly, allowing a clean comparison of Muon and AdamW.

Results & Findings

Experiment	Optimizer used in pre‑training	Optimizer used in fine‑tuning	Downstream task score	Forgetting (pre‑training LM loss)
Standard SFT (AdamW)	AdamW	AdamW (same)	↑↑ (baseline)	Small increase (low forgetting)
Mismatched optimizer	AdamW	AdamW → SGD	Same / slightly lower	Larger increase (more forgetting)
LoRA (AdamW pre‑train)	AdamW	LoRA (AdamW)	Comparable	Noticeable forgetting
Muon pre‑train, AdamW fine‑tune	Muon	AdamW	↓ (worse)	High forgetting
Muon pre‑train, Muon fine‑tune	Muon	Muon (same)	Slightly better than mismatched	Still higher forgetting than AdamW‑AdamW

Optimizer‑model consistency yields the best trade‑off: identical optimizer across stages retains more pre‑training knowledge while achieving equal or better downstream accuracy.
AdamW outperforms Muon for reasoning tasks: Muon’s strong memorization bias harms fine‑tuning when data is scarce, confirming the synthetic experiment’s conclusion.
Activation regularization patterns: AdamW encourages smoother activation distributions, creating a flatter loss landscape that is easier to navigate during fine‑tuning without destabilizing the pre‑trained weights.

Practical Implications

Simplify fine‑tuning pipelines – Teams can drop LoRA adapters or custom optimizer schedules and simply reuse the pre‑training optimizer, reducing engineering overhead.
Lower risk of catastrophic forgetting – Critical for applications that must retain general language abilities (e.g., chatbots that continue to answer open‑ended queries after task‑specific fine‑tuning).
Optimizer selection matters – When pre‑training with AdamW, stick with AdamW for downstream tasks; avoid optimizers that bias toward memorization (e.g., Muon) if you expect to fine‑tune on limited data.
Resource‑efficient development – Full‑parameter fine‑tuning with the same optimizer can be run on the same hardware configuration used for pre‑training, avoiding extra memory for adapter layers.
Guidance for open‑source model releases – Model providers can publish the optimizer hyper‑parameters alongside the checkpoint, enabling downstream users to replicate the consistency benefit out‑of‑the‑box.

Limitations & Future Work

Scope of models – Experiments focus on medium‑scale LLMs; it remains to be validated on the newest multi‑billion‑parameter models where optimizer dynamics may differ.
Task diversity – The study covers classification and reasoning tasks; other domains (e.g., code generation, multimodal fine‑tuning) need separate verification.
Hyper‑parameter sensitivity – While the same optimizer is used, the optimal learning‑rate and weight‑decay for fine‑tuning may still differ; the paper does not exhaustively explore this space.
Theoretical assumptions – The regularization analysis assumes smooth activation statistics; highly sparse or quantized models could break these assumptions.
Future directions – Extending the analysis to optimizer families (e.g., RMSProp, Adafactor), exploring adaptive learning‑rate schedules that preserve consistency, and integrating the insight with parameter‑efficient methods (e.g., combining LoRA with same‑optimizer fine‑tuning) are promising avenues.

Authors

Yuxing Liu
Jianyu Wang
Tong Zhang

Paper Information

arXiv ID: 2605.06654v1
Categories: cs.LG, cs.AI, math.OC
Published: May 7, 2026
PDF: Download PDF

[Paper] Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Normalizing Trajectory Models

[Paper] Zero-Shot Imagined Speech Decoding via Imagined-to-Listened MEG Mapping

[Paper] GRAPHLCP: Structure-Aware Localized Conformal Prediction on Graphs

[Paper] EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction