[Paper] OptEMA: Adaptive Exponential Moving Average for Stochastic Optimization with Zero-Noise Optimality
Source: arXiv - 2603.09923v1
Overview
The paper introduces OptEMA, a new way to tune the exponential moving average (EMA) that sits at the heart of Adam‑style optimizers. By making the EMA coefficient adaptive and dependent on the optimization trajectory, OptEMA closes the gap between theory and practice—delivering optimal convergence when there is no stochastic noise and strong, noise‑aware guarantees otherwise, all without needing hand‑crafted Lipschitz constants.
Key Contributions
- Adaptive EMA coefficient: Proposes two variants (OptEMA‑M and OptEMA‑V) that dynamically adjust the decay rate of the first‑order or second‑order moment, respectively.
- Closed‑loop, Lipschitz‑free design: Effective stepsizes are computed on‑the‑fly from the observed gradients, eliminating the need for a priori knowledge of smoothness constants.
- Unified convergence theory: Proves a noise‑adaptive bound
[ \widetilde{\mathcal{O}}!\big(T^{-1/2}+ \sigma^{1/2} T^{-1/4}\big) ]
for the average gradient norm, which collapses to the near‑optimal deterministic rate (\widetilde{\mathcal{O}}(T^{-1/2})) when the stochastic variance (\sigma = 0). - Minimal assumptions: Works under the standard stochastic‑gradient assumptions (smoothness, lower‑bounded objective, unbiased gradients with bounded variance) and does not require bounded gradients or a known Lipschitz constant.
- Practical hyperparameter robustness: The adaptive EMA removes the need for manual retuning when the noise level changes (e.g., switching from pre‑training to fine‑tuning).
Methodology
-
Problem setting – Minimize a smooth, possibly non‑convex loss (f(x)) using stochastic gradients (g_t) with (\mathbb{E}[g_t]=\nabla f(x_t)) and (\mathbb{E}|g_t-\nabla f(x_t)|^2\le\sigma^2).
-
EMA backbone – Classic Adam keeps exponential averages of the first moment (m_t) and second moment (v_t):
[ m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t,\qquad v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 . ] -
OptEMA twist – Instead of fixing (\beta_1,\beta_2), OptEMA computes a trajectory‑dependent decay (\beta_t) that shrinks as the algorithm progresses:
- OptEMA‑M: (\beta_t) is applied to the first‑order moment (m_t) while the second‑order decay stays constant.
- OptEMA‑V: The roles are swapped; the adaptive decay is applied to (v_t).
The decay is chosen so that the effective stepsize (\alpha_t = \eta / \sqrt{v_t}) (or its analogue) automatically scales with the observed gradient magnitude, mimicking a closed‑loop controller.
-
Analysis technique – The authors bound the expected decrease of a Lyapunov function that couples the objective value and the EMA states. By carefully tracking how the adaptive decay reacts to gradient variance, they derive the unified rate above, with separate terms that dominate in high‑noise vs. zero‑noise regimes.
Results & Findings
| Setting | Convergence rate (average gradient norm) |
|---|---|
| General stochastic ((\sigma>0)) | (\widetilde{\mathcal{O}}\big(T^{-1/2}+ \sigma^{1/2} T^{-1/4}\big)) |
| Zero‑noise ((\sigma=0)) | (\widetilde{\mathcal{O}}(T^{-1/2})) (near‑optimal deterministic) |
- Noise adaptivity: When noise is high, the (\sigma^{1/2} T^{-1/4}) term dominates, matching the best known stochastic rates for Adam‑style methods.
- Deterministic optimality: In the absence of noise, OptEMA automatically behaves like a well‑tuned deterministic optimizer, achieving the classic (\mathcal{O}(T^{-1/2})) rate without any manual step‑size scaling.
- No Lipschitz constant needed: The algorithm’s internal scaling replaces the usual (\frac{1}{L}) factor (where (L) is the smoothness constant), simplifying deployment across heterogeneous datasets and models.
Practical Implications
-
Robust training pipelines – OptEMA can be dropped into existing PyTorch/TensorFlow codebases as a drop‑in replacement for Adam, offering better out‑of‑the‑box performance when the data distribution changes (e.g., curriculum learning, domain shift).
-
Reduced hyperparameter search – Since the EMA decay adapts automatically, practitioners can fix a single learning‑rate schedule and avoid the tedious tuning of (\beta_1,\beta_2) that typically depends on batch size or noise level.
-
Better fine‑tuning – In transfer learning, the stochastic noise often drops dramatically after the initial pre‑training phase. OptEMA’s zero‑noise optimality means the optimizer will automatically tighten its steps, leading to faster convergence and potentially higher final accuracy.
-
Edge‑device training – On devices where exact Lipschitz constants are unknown and batch sizes vary, OptEMA’s closed‑loop behavior ensures stable learning without hand‑crafted safety margins.
-
Research reproducibility – The theoretical guarantees hold under the same mild assumptions used by most SGD analyses, making it easier for researchers to compare OptEMA against baseline optimizers on new benchmarks.
Limitations & Future Work
- Empirical validation – The paper focuses on theoretical guarantees; extensive experiments on large‑scale vision/NLP models would solidify confidence in real‑world settings.
- Memory overhead – Maintaining adaptive decay coefficients adds a small per‑parameter state; while negligible for most models, it could matter for extremely large sparse embeddings.
- Non‑smooth objectives – The analysis assumes smoothness; extending OptEMA to handle non‑smooth regularizers (e.g., (\ell_1) penalties) is an open direction.
- Adaptive second‑order decay – OptEMA‑V swaps the adaptive decay to the second moment, but the paper does not explore hybrid schemes where both moments adapt jointly. Future work could investigate such mixed strategies.
Overall, OptEMA offers a theoretically sound, practically appealing upgrade to the EMA machinery that powers today’s most popular stochastic optimizers. Its ability to automatically bridge the gap between noisy stochastic training and deterministic fine‑tuning makes it a promising tool for developers building robust, production‑grade machine‑learning systems.
Authors
- Ganzhao Yuan
Paper Information
- arXiv ID: 2603.09923v1
- Categories: cs.LG, math.NA, math.OC
- Published: March 10, 2026
- PDF: Download PDF