[Paper] Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
Source: arXiv - 2605.06654v1
Overview
The authors investigate a surprisingly simple yet powerful idea: when you fine‑tune a large language model (LLM), keep using the exact same optimizer you used during pre‑training. Their experiments show that this “optimizer‑model consistency” dramatically reduces catastrophic forgetting while still reaching (or even surpassing) the performance of conventional fine‑tuning pipelines that switch optimizers or rely on parameter‑efficient tricks like LoRA.
Key Contributions
- Empirical discovery of optimizer‑model consistency: Full‑parameter fine‑tuning with the pre‑training optimizer consistently forgets less than alternative optimizers or LoRA‑based methods.
- Regularization perspective: Demonstrates that optimizers act as implicit regularizers on hidden activations, shaping the loss landscape around the pre‑trained checkpoint.
- Theoretical insight: Shows that, given the optimizer‑induced regularization, the optimal fine‑tuning weight updates must follow a specific structure that is naturally produced when the same optimizer is reused.
- Optimizer comparison (Muon vs. AdamW): Provides a controlled study revealing that Muon, which encourages rote memorization, harms reasoning‑task fine‑tuning compared to AdamW.
- Synthetic language‑model experiment: Isolates the memorization effect and confirms that strong memorization impedes pattern learning when only a small fine‑tuning dataset is available.
Methodology
- Controlled fine‑tuning experiments – The authors take several publicly available LLM checkpoints (e.g., GPT‑2‑like models) and fine‑tune them on downstream tasks (classification, reasoning, etc.) using different optimizers: the original pre‑training optimizer (AdamW or Muon) versus a mismatched optimizer and LoRA.
- Forgetting measurement – After fine‑tuning, they evaluate the model on a held‑out “pre‑training” test set (e.g., language modeling perplexity) to quantify how much knowledge was lost, alongside the downstream task performance.
- Activation regularization analysis – By tracking activation statistics (norms, variance) during pre‑training, they characterize each optimizer’s implicit regularization effect.
- Theoretical modeling – They formalize the optimizer’s regularization as a penalty term in the loss and derive conditions under which the fine‑tuning gradient aligns with the pre‑training landscape, minimizing forgetting.
- Synthetic memorization benchmark – A toy language modeling dataset is constructed where memorization vs. pattern learning can be measured directly, allowing a clean comparison of Muon and AdamW.
Results & Findings
| Experiment | Optimizer used in pre‑training | Optimizer used in fine‑tuning | Downstream task score | Forgetting (pre‑training LM loss) |
|---|---|---|---|---|
| Standard SFT (AdamW) | AdamW | AdamW (same) | ↑↑ (baseline) | Small increase (low forgetting) |
| Mismatched optimizer | AdamW | AdamW → SGD | Same / slightly lower | Larger increase (more forgetting) |
| LoRA (AdamW pre‑train) | AdamW | LoRA (AdamW) | Comparable | Noticeable forgetting |
| Muon pre‑train, AdamW fine‑tune | Muon | AdamW | ↓ (worse) | High forgetting |
| Muon pre‑train, Muon fine‑tune | Muon | Muon (same) | Slightly better than mismatched | Still higher forgetting than AdamW‑AdamW |
- Optimizer‑model consistency yields the best trade‑off: identical optimizer across stages retains more pre‑training knowledge while achieving equal or better downstream accuracy.
- AdamW outperforms Muon for reasoning tasks: Muon’s strong memorization bias harms fine‑tuning when data is scarce, confirming the synthetic experiment’s conclusion.
- Activation regularization patterns: AdamW encourages smoother activation distributions, creating a flatter loss landscape that is easier to navigate during fine‑tuning without destabilizing the pre‑trained weights.
Practical Implications
- Simplify fine‑tuning pipelines – Teams can drop LoRA adapters or custom optimizer schedules and simply reuse the pre‑training optimizer, reducing engineering overhead.
- Lower risk of catastrophic forgetting – Critical for applications that must retain general language abilities (e.g., chatbots that continue to answer open‑ended queries after task‑specific fine‑tuning).
- Optimizer selection matters – When pre‑training with AdamW, stick with AdamW for downstream tasks; avoid optimizers that bias toward memorization (e.g., Muon) if you expect to fine‑tune on limited data.
- Resource‑efficient development – Full‑parameter fine‑tuning with the same optimizer can be run on the same hardware configuration used for pre‑training, avoiding extra memory for adapter layers.
- Guidance for open‑source model releases – Model providers can publish the optimizer hyper‑parameters alongside the checkpoint, enabling downstream users to replicate the consistency benefit out‑of‑the‑box.
Limitations & Future Work
- Scope of models – Experiments focus on medium‑scale LLMs; it remains to be validated on the newest multi‑billion‑parameter models where optimizer dynamics may differ.
- Task diversity – The study covers classification and reasoning tasks; other domains (e.g., code generation, multimodal fine‑tuning) need separate verification.
- Hyper‑parameter sensitivity – While the same optimizer is used, the optimal learning‑rate and weight‑decay for fine‑tuning may still differ; the paper does not exhaustively explore this space.
- Theoretical assumptions – The regularization analysis assumes smooth activation statistics; highly sparse or quantized models could break these assumptions.
- Future directions – Extending the analysis to optimizer families (e.g., RMSProp, Adafactor), exploring adaptive learning‑rate schedules that preserve consistency, and integrating the insight with parameter‑efficient methods (e.g., combining LoRA with same‑optimizer fine‑tuning) are promising avenues.
Authors
- Yuxing Liu
- Jianyu Wang
- Tong Zhang
Paper Information
- arXiv ID: 2605.06654v1
- Categories: cs.LG, cs.AI, math.OC
- Published: May 7, 2026
- PDF: Download PDF