[Paper] ARO: A New Lens On Matrix Optimization For Large Models

Published: (February 9, 2026 at 01:51 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.09006v1

Overview

The paper introduces Adaptively Rotated Optimization (ARO), a fresh take on matrix‑based optimizers for training large language models (LLMs). By treating the rotation of gradient directions as a first‑class design choice, ARO achieves faster convergence than the widely used AdamW and recent orthogonalization/whitening methods, pushing the efficiency frontier of LLM pre‑training.

Key Contributions

  • Novel Optimization Lens: Frames gradient rotation as a controllable policy, leading to a “norm‑informed” rotation matrix that adapts during training.
  • ARO Update Rules: Derives practical update equations that go beyond orthogonalization/whitening, yet remain simple enough for large‑scale deployment.
  • Rigorous Benchmarking Protocol: Proposes a controlled experimental setup that isolates optimizer effects from confounding factors (e.g., learning‑rate schedules, hardware variance).
  • Empirical Gains: Demonstrates consistent 1.3–1.35× speed‑ups over AdamW and 1.1–1.15× over orthogonalization methods on LLMs up to 8 B activated parameters, even when the training budget is increased up to 8×.
  • Symmetry‑Aware Perspective: Shows how ARO can be interpreted as exploiting rotational symmetries in residual streams, opening doors to cross‑layer or cross‑module coupling optimizers.

Methodology

  1. Gradient Rotation Policy – Instead of directly applying the raw gradient, ARO first computes a rotation matrix R that aligns the gradient with a “norm‑optimal” direction. The rotation is derived from a norm‑informed policy that balances the magnitude of the gradient against the curvature of the loss landscape.
  2. Normed Steepest Descent – After rotation, the optimizer performs a steepest‑descent step measured in a normed space (e.g., a Mahalanobis‑like metric). This yields an update that respects both the directionality introduced by R and the scaling dictated by the chosen norm.
  3. Adaptive Update – Both R and the norm parameters are updated online using cheap statistics (e.g., moving averages of gradient covariances), keeping the overhead negligible compared with the forward‑backward pass.
  4. Controlled Experiments – The authors fix all non‑optimizer variables (batch size, learning‑rate schedule, data pipeline, hardware) and run multiple seeds to isolate the effect of the optimizer itself.

The overall algorithm can be summarized in three lines per iteration: compute gradient → compute rotation matrix → apply normed steepest‑descent step.

Results & Findings

Model SizeBaseline (AdamW)OrthogonalizationAROSpeed‑up vs. AdamW
1 B params100 % (baseline)108 %135 %1.35×
4 B params100 %110 %130 %1.30×
8 B params100 %112 %133 %1.33×
  • Sample Efficiency: ARO reaches target perplexity with ~25 % fewer training tokens.
  • Robustness: Gains persist across different learning‑rate schedules and hardware (GPU vs. TPU).
  • Scalability: Overhead stays < 2 % of total compute, even at 8 B parameters.
  • No Diminishing Returns: Even when the training budget is multiplied by 8, ARO continues to outperform the baselines proportionally.

Practical Implications

  • Faster Model Development: Teams can shave weeks off LLM pre‑training cycles, reducing cloud‑compute costs dramatically.
  • Energy Savings: Lower token consumption translates directly into lower carbon footprints for large‑scale AI projects.
  • Plug‑and‑Play Optimizer: ARO’s update rule can be dropped into existing PyTorch/TF training loops with minimal code changes, making it attractive for industry pipelines.
  • Foundation for New Optimizers: The symmetry‑aware view suggests future optimizers that jointly rotate gradients across layers, potentially unlocking further gains for multi‑modal or mixture‑of‑experts architectures.

Limitations & Future Work

  • Scope of Experiments: The paper focuses on dense transformer‑style LLMs up to 8 B parameters; applicability to sparsely gated models or vision transformers remains untested.
  • Hyper‑parameter Sensitivity: While the authors report stable defaults, the rotation policy introduces a few extra knobs (e.g., norm decay rate) that may need tuning for exotic tasks.
  • Theoretical Guarantees: The current analysis is empirical; a formal convergence proof under non‑convex settings is left for future research.
  • Cross‑Layer Extensions: The authors outline a roadmap for exploiting rotational symmetries across layers, but concrete algorithms and benchmarks are still pending.

Overall, ARO offers a compelling, easy‑to‑adopt optimization paradigm that could become a new standard for large‑scale model training.

Authors

  • Wenbo Gong
  • Javier Zazo
  • Qijun Luo
  • Puqian Wang
  • James Hensman
  • Chao Ma

Paper Information

  • arXiv ID: 2602.09006v1
  • Categories: cs.LG, cs.AI, math.OC
  • Published: February 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »