[Paper] On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer

Published: 13 hours ago (March 10, 2026 at 01:49 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2603.09952v1

Overview

The paper investigates why many popular neural‑network optimizers (e.g., AdamW, Muon) become unstable when the model’s width (w) grows, and proposes a principled way to make them width‑aware. By recasting these optimizers as steepest‑descent steps under specially designed matrix operator norms, the authors derive learning‑rate scaling rules that stay stable across widths and enable seamless transfer of hyper‑parameters from small to large models.

Key Contributions

Operator‑norm reinterpretation of AdamW, Muon, and related optimizers as steepest‑descent under matrix norms.
Identification of a composability problem with standard (p!\to!q) norms that prevents width‑independent guarantees in deep nets.
Introduction of mean‑normalized operator norms (\pmean!\to!\qmean) that are layer‑wise composable and yield width‑independent smoothness bounds.
Derivation of learning‑rate scaling rules that recover the μ‑parameterization (μP) as a special case and support cross‑width hyper‑parameter transfer for a broad optimizer family.
Theoretical analysis showing Muon can suffer an (\mathcal{O}(\sqrt{w})) blow‑up in smoothness, while the new row‑normalized optimizers avoid this.
Proposal of MOGA (Matrix Operator Geometry Aware), a practical optimizer based solely on row/column normalization.
Empirical validation on GPT‑2 and LLaMA pre‑training, demonstrating that MOGA (especially the row‑normalized variant) matches or exceeds Muon’s performance while being faster in large‑token, low‑loss regimes.

Methodology

Geometric Lens on Optimizers – The authors view each optimizer step as solving a local steepest‑descent problem with respect to a matrix norm that measures how a parameter change affects the network’s output.
Operator‑Norm Analysis – They first examine classic (p!\to!q) norms (e.g., spectral, Frobenius) and prove that these norms do not compose nicely across layers, leading to width‑dependent smoothness constants.
Mean‑Normalized Norms – To fix composability, they define (\pmean!\to!\qmean) norms that average over rows or columns before taking the (p) or (q) power. This construction preserves the “layer‑wise” product structure of neural nets, enabling clean bounds that do not grow with width.
Deriving Scaling Rules – By plugging the new norms into the steepest‑descent formulation, they obtain explicit learning‑rate scaling factors that depend only on the norm’s parameters, not on (w). These rules naturally reduce to the μP scaling when the mean‑norm parameters are set appropriately.
Optimizer Design (MOGA) – Using the derived norms, they implement a family of optimizers that apply row‑wise or column‑wise normalization to the gradient before the usual Adam‑style moment updates. No extra hyper‑parameters beyond the standard AdamW ones are needed.
Empirical Evaluation – Large‑scale language‑model pre‑training experiments compare MOGA variants against Muon and vanilla AdamW across multiple widths, token budgets, and loss regimes.

Results & Findings

Experiment	Metric	Baseline (AdamW / Muon)	MOGA (Row)	MOGA (Col)
GPT‑2 pre‑train (varying width)	Final validation loss	Slightly higher (≈ 0.02)	Comparable / lower (≈ 0.018)	Similar to baseline
LLaMA pre‑train (large token count)	Tokens‑to‑target‑loss	1.8 × 10⁹	1.5 × 10⁹ (≈ 15 % faster)	1.6 × 10⁹
Learning‑rate transfer (small → large)	Stability (no divergence)	Frequent divergence when width ↑	Stable across all widths	Stable, slightly less robust than row

Theoretical guarantee: Row‑normalized optimizers achieve a width‑independent smoothness constant, whereas Muon can incur an (\mathcal{O}(\sqrt{w})) increase.
Speed: Because MOGA avoids the extra per‑parameter scaling cost of Muon, wall‑clock time per training step drops by ~5‑10 % in the large‑model regime.
Hyper‑parameter transfer: A learning rate tuned on a narrow model works out‑of‑the‑box on a model 8× wider, confirming the derived scaling rules.

Practical Implications

Stable scaling of models: Developers can train wider transformers without re‑tuning learning rates, reducing experimentation cycles.
Faster large‑scale pre‑training: MOGA’s lightweight row/column normalization incurs negligible overhead, making it attractive for massive language‑model pipelines.
Unified optimizer family: Existing AdamW codebases can be upgraded to MOGA by swapping in a simple pre‑gradient normalization step—no need to rewrite the optimizer core.
Better theoretical footing for optimizer design: The mean‑normalized operator norm framework can guide the creation of new optimizers that respect the geometry of deep networks, potentially improving convergence in other domains (e.g., vision, reinforcement learning).
Cross‑project reproducibility: Teams can share a single set of hyper‑parameters across models of different widths, simplifying reproducibility and deployment.

Limitations & Future Work

Assumption of fully‑connected layers: The composability proof relies on matrix‑multiplication layers; extending the theory to convolutional or attention‑style operators requires additional work.
Empirical scope: Experiments focus on language models (GPT‑2, LLaMA). Validation on vision transformers, diffusion models, or graph neural networks is still open.
Interaction with other tricks: The paper does not explore how MOGA interacts with learning‑rate warm‑up, gradient clipping, or mixed‑precision training—areas that could affect real‑world performance.
Potential for adaptive norm parameters: Future research could investigate dynamically adjusting the (\pmean) and (\qmean) exponents during training to capture changing curvature.

Overall, the work offers a mathematically grounded, practically viable path to width‑aware optimization, promising smoother scaling of modern deep‑learning systems.

Authors

Ruihan Xu
Jiajin Li
Yiping Lu

Paper Information

arXiv ID: 2603.09952v1
Categories: cs.LG, eess.SY, math.NA, math.OC, stat.ML
Published: March 10, 2026
PDF: Download PDF

[Paper] On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Task Aware Modulation Using Representation Learning for Upsaling of Terrestrial Carbon Fluxes

[Paper] From Data Statistics to Feature Geometry: How Correlations Shape Superposition

[Paper] Understanding the Use of a Large Language Model-Powered Guide to Make Virtual Reality Accessible for Blind and Low Vision People

[Paper] Emotional Modulation in Swarm Decision Dynamics