[Paper] ARO: A New Lens On Matrix Optimization For Large Models

Published: 3 days ago (February 9, 2026 at 01:51 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.09006v1

Overview

The paper introduces Adaptively Rotated Optimization (ARO), a fresh take on matrix‑based optimizers for training large language models (LLMs). By treating the rotation of gradient directions as a first‑class design choice, ARO achieves faster convergence than the widely used AdamW and recent orthogonalization/whitening methods, pushing the efficiency frontier of LLM pre‑training.

Key Contributions

Novel Optimization Lens: Frames gradient rotation as a controllable policy, leading to a “norm‑informed” rotation matrix that adapts during training.
ARO Update Rules: Derives practical update equations that go beyond orthogonalization/whitening, yet remain simple enough for large‑scale deployment.
Rigorous Benchmarking Protocol: Proposes a controlled experimental setup that isolates optimizer effects from confounding factors (e.g., learning‑rate schedules, hardware variance).
Empirical Gains: Demonstrates consistent 1.3–1.35× speed‑ups over AdamW and 1.1–1.15× over orthogonalization methods on LLMs up to 8 B activated parameters, even when the training budget is increased up to 8×.
Symmetry‑Aware Perspective: Shows how ARO can be interpreted as exploiting rotational symmetries in residual streams, opening doors to cross‑layer or cross‑module coupling optimizers.

Methodology

Gradient Rotation Policy – Instead of directly applying the raw gradient, ARO first computes a rotation matrix R that aligns the gradient with a “norm‑optimal” direction. The rotation is derived from a norm‑informed policy that balances the magnitude of the gradient against the curvature of the loss landscape.
Normed Steepest Descent – After rotation, the optimizer performs a steepest‑descent step measured in a normed space (e.g., a Mahalanobis‑like metric). This yields an update that respects both the directionality introduced by R and the scaling dictated by the chosen norm.
Adaptive Update – Both R and the norm parameters are updated online using cheap statistics (e.g., moving averages of gradient covariances), keeping the overhead negligible compared with the forward‑backward pass.
Controlled Experiments – The authors fix all non‑optimizer variables (batch size, learning‑rate schedule, data pipeline, hardware) and run multiple seeds to isolate the effect of the optimizer itself.

The overall algorithm can be summarized in three lines per iteration: compute gradient → compute rotation matrix → apply normed steepest‑descent step.

Results & Findings

Model Size	Baseline (AdamW)	Orthogonalization	ARO	Speed‑up vs. AdamW
1 B params	100 % (baseline)	108 %	135 %	1.35×
4 B params	100 %	110 %	130 %	1.30×
8 B params	100 %	112 %	133 %	1.33×

Sample Efficiency: ARO reaches target perplexity with ~25 % fewer training tokens.
Robustness: Gains persist across different learning‑rate schedules and hardware (GPU vs. TPU).
Scalability: Overhead stays < 2 % of total compute, even at 8 B parameters.
No Diminishing Returns: Even when the training budget is multiplied by 8, ARO continues to outperform the baselines proportionally.

Practical Implications

Faster Model Development: Teams can shave weeks off LLM pre‑training cycles, reducing cloud‑compute costs dramatically.
Energy Savings: Lower token consumption translates directly into lower carbon footprints for large‑scale AI projects.
Plug‑and‑Play Optimizer: ARO’s update rule can be dropped into existing PyTorch/TF training loops with minimal code changes, making it attractive for industry pipelines.
Foundation for New Optimizers: The symmetry‑aware view suggests future optimizers that jointly rotate gradients across layers, potentially unlocking further gains for multi‑modal or mixture‑of‑experts architectures.

Limitations & Future Work

Scope of Experiments: The paper focuses on dense transformer‑style LLMs up to 8 B parameters; applicability to sparsely gated models or vision transformers remains untested.
Hyper‑parameter Sensitivity: While the authors report stable defaults, the rotation policy introduces a few extra knobs (e.g., norm decay rate) that may need tuning for exotic tasks.
Theoretical Guarantees: The current analysis is empirical; a formal convergence proof under non‑convex settings is left for future research.
Cross‑Layer Extensions: The authors outline a roadmap for exploiting rotational symmetries across layers, but concrete algorithms and benchmarks are still pending.

Overall, ARO offers a compelling, easy‑to‑adopt optimization paradigm that could become a new standard for large‑scale model training.

Authors

Wenbo Gong
Javier Zazo
Qijun Luo
Puqian Wang
James Hensman
Chao Ma

Paper Information

arXiv ID: 2602.09006v1
Categories: cs.LG, cs.AI, math.OC
Published: February 9, 2026
PDF: Download PDF

[Paper] ARO: A New Lens On Matrix Optimization For Large Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Diffusion-Pretrained Dense and Contextual Embeddings

[Paper] YOR: Your Own Mobile Manipulator for Generalizable Robotics

[Paper] Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

[Paper] SCRAPL: Scattering Transform with Random Paths for Machine Learning