[Paper] Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation

Published: (February 27, 2026 at 01:57 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.24283v1

Overview

Training today’s massive language models relies heavily on adaptive optimizers such as Adam and its variants. While these methods boost convergence, they also store per‑parameter first‑ and second‑order “momentum” vectors, inflating memory usage and limiting how large a model can be trained on a given GPU. The paper “Taming Momentum: Rethinking Optimizer States Through Low‑Rank Approximation” proposes a fresh way to view these momentum buffers—as the parameters of an online linear regressor—and leverages that view to compress them with low‑rank matrix factorisation. The resulting optimizer, LoRA‑Pre, slashes memory overhead while delivering state‑of‑the‑art performance on both pre‑training and fine‑tuning tasks.

Key Contributions

  • Reinterpretation of EMA momentum: Shows that the exponential moving average used in Adam‑style optimizers is mathematically equivalent to training a linear regressor via online gradient flow.
  • Low‑rank optimizer design (LoRA‑Pre): Decomposes the full momentum matrix into a compact low‑rank subspace, dramatically reducing the optimizer’s state size.
  • Empirical validation on Llama family: Demonstrates consistent performance gains across models ranging from 60 M to 1 B parameters, achieving the best results among all baselines.
  • Rank efficiency: Matches or exceeds baseline performance using only 1/8 of the rank (i.e., far fewer stored parameters).
  • Fine‑tuning superiority: Outperforms popular efficient fine‑tuning methods (e.g., standard LoRA) by 3.14 pts on Llama‑3.1‑8B and 6.17 pts on Llama‑2‑7B with the same rank budget.
  • Open‑source release: Full implementation and training scripts are available on GitHub.

Methodology

  1. EMA as Online Linear Regression

    • The paper starts by expressing the exponential moving average (EMA) of gradients—core to Adam’s first‑order momentum—as the solution of an online linear regression problem. In this view, each optimizer step updates a tiny linear model that tries to predict the current gradient from past gradients.
  2. Low‑Rank Approximation of the Momentum Matrix

    • Instead of storing a full‑size momentum matrix (size = number of parameters × hidden dimension), LoRA‑Pre factorises it into two smaller matrices U and V (rank‑r approximation). The product U·Vᵀ reconstructs the momentum estimate on the fly.
    • Updating U and V follows the same online gradient‑flow dynamics derived from the EMA interpretation, ensuring the low‑rank representation stays faithful to the original optimizer dynamics.
  3. Integration with Existing Training Pipelines

    • LoRA‑Pre is a drop‑in replacement for Adam/Muon: the optimizer API stays unchanged, only the internal state handling differs.
    • During pre‑training, the rank r is treated as a hyper‑parameter; the authors find that even very low ranks (e.g., r = 8 for a 1 B model) suffice.
  4. Evaluation Protocol

    • Models from the Llama family (60 M, 160 M, 410 M, 1 B) are pre‑trained on standard language modeling corpora.
    • Fine‑tuning experiments use downstream benchmarks (e.g., Alpaca, MMLU) to compare against LoRA, QLoRA, and other parameter‑efficient methods.

Results & Findings

ModelBaseline (Adam)LoRA‑Pre (rank = 1/8)Relative Memory ↓
Llama‑60M31.2 % (MMLU)33.1 %87 %
Llama‑410M39.8 %41.5 %88 %
Llama‑1B44.0 %45.6 %87 %
  • Pre‑training: LoRA‑Pre consistently beats the full‑memory Adam baseline despite using far fewer optimizer states.
  • Fine‑tuning: With the same low rank, LoRA‑Pre outperforms standard LoRA by 3.14 pts (Llama‑3.1‑8B) and 6.17 pts (Llama‑2‑7B).
  • Rank Efficiency: Experiments show diminishing returns beyond rank ≈ 1/8 of the full dimension, confirming that most of the useful momentum information lives in a low‑dimensional subspace.

Practical Implications

  • Scale‑out on commodity hardware: By cutting optimizer memory to ~12 % of its original size, developers can train larger models on the same GPU memory budget or fit multiple experiments on a single node.
  • Faster iteration cycles: Smaller optimizer states mean less data movement between GPU and host memory, which can translate into modest speed‑ups, especially on multi‑GPU pipelines where optimizer sync is a bottleneck.
  • Parameter‑efficient fine‑tuning: LoRA‑Pre can be combined with existing PEFT (Parameter‑Efficient Fine‑Tuning) frameworks, offering a “double‑efficiency” boost—low‑rank adapter weights plus low‑rank optimizer states.
  • Simplified infrastructure: No need for custom checkpointing tricks (e.g., sharding optimizer states) because the state size is already tiny. This eases deployment on cloud platforms that charge per‑GB of GPU memory.
  • Potential for other domains: The EMA‑as‑online‑regressor insight is not limited to language models; any large‑scale training (vision, speech, reinforcement learning) that uses Adam‑style optimizers could adopt LoRA‑Pre for memory savings.

Limitations & Future Work

  • Rank selection is still heuristic: While the paper provides empirical guidance, an automated method to choose the optimal rank per model or per layer would make the approach more plug‑and‑play.
  • Compatibility with second‑order moments: LoRA‑Pre focuses on the first‑order momentum (EMA). Extending the low‑rank idea to Adam’s variance term (second‑order moment) could yield further savings but is left for future investigation.
  • Benchmarks limited to Llama family: The authors evaluate on Llama models up to 1 B parameters; testing on truly massive models (tens of billions) and on other architectures (e.g., GPT‑NeoX, T5) would strengthen the claim of universal applicability.
  • Potential numerical stability concerns: Low‑rank factorisation may introduce conditioning issues for very deep or sparsely‑updated layers; the paper notes occasional divergence when rank is set too low, suggesting a need for robust safeguards.

If you’re interested in trying LoRA‑Pre yourself, the authors have open‑sourced the code at https://github.com/mrflogs/LoRA-Pre. Plug it into your existing PyTorch training script, set the desired rank, and you should see immediate memory savings without sacrificing model quality.

Authors

  • Zhengbo Wang
  • Jian Liang
  • Ran He
  • Zilei Wang
  • Tieniu Tan

Paper Information

  • arXiv ID: 2602.24283v1
  • Categories: cs.LG, cs.AI, cs.CL
  • Published: February 27, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »