[Paper] TEON: Tensorized Orthonormalization Beyond Layer-Wise Muon for Large Language Model Pre-Training

Published: 3 months ago (January 30, 2026 at 01:30 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2601.23261v1

Overview

The paper introduces TEON (Tensorized Orthonormalization), a new optimizer that extends the successful Muon technique from per‑layer matrix orthogonalization to a full‑network, tensor‑level treatment of gradients. By doing so, TEON offers tighter convergence guarantees and consistently better perplexity when pre‑training large language models (LLMs) ranging from 60 M to 1 B parameters.

Key Contributions

Tensor‑level orthogonalization: Generalizes Muon’s layer‑wise matrix orthogonalization to a structured higher‑order tensor that captures inter‑layer gradient relationships.
Theoretical improvement: Provides a stronger convergence bound than Muon, showing that global orthogonalization reduces gradient variance more effectively.
Practical TEON algorithm: Derives a computationally tractable instantiation using approximate SVD (e.g., randomized power iteration) and demonstrates that it works with standard deep‑learning toolkits.
Extensive empirical validation: Benchmarks on GPT‑style (130 M–774 M) and LLaMA‑style (60 M–1 B) models, achieving lower training/validation perplexity across all scales.
Robustness analysis: Shows TEON remains effective under a variety of low‑rank SVD approximations, making it suitable for large‑scale distributed training.

Methodology

Gradient Tensor Construction – Instead of treating each layer’s gradient matrix $G_\ell$ in isolation, TEON stacks them into a 3‑D tensor $\mathcal{G} \in \mathbb{R}^{L \times d_{\text{in}} \times d_{\text{out}}}$ (layer, input dim, output dim).
Tensor Orthogonalization – TEON seeks an orthonormal basis $\mathcal{Q}$ for $\mathcal{G}$ such that $\mathcal{Q}^\top \mathcal{Q} = I$. This is achieved by applying a higher‑order singular value decomposition (HOSVD) or a cheaper randomized approximation.
Update Rule – The optimizer projects the raw gradient (or momentum) onto the orthonormal basis, yielding an orthogonalized gradient $\tilde{\mathcal{G}} = \mathcal{Q}\mathcal{Q}^\top \mathcal{G}$. Standard Adam‑like step sizes are then applied.
Approximation Strategies – To keep the cost manageable, the authors experiment with:
- Randomized power iteration (few iterations) for each mode of the tensor.
- Low‑rank truncation (keeping only top‑k singular components).
- Layer‑wise fallback (reverting to Muon when tensor cost exceeds a threshold).

The resulting algorithm adds a modest overhead (≈ 5–10 % extra compute) while preserving the memory footprint of typical Adam‑style optimizers.

Results & Findings

Model	Params	Optimizer	Train PPL ↓	Val PPL ↓	Speed Impact
GPT‑style	130 M	Adam	12.4	13.1	–
		Muon	11.8	12.5	+4 %
		TEON	11.2	11.9	+7 %
GPT‑style	774 M	Adam	7.9	8.3	–
		Muon	7.4	7.8	+4 %
		TEON	6.9	7.3	+8 %
LLaMA‑style	60 M	Adam	14.2	15.0	–
		Muon	13.5	14.2	+4 %
		TEON	12.9	13.5	+9 %
LLaMA‑style	1 B	Adam	8.6	9.0	–
		Muon	8.1	8.5	+4 %
		TEON	7.6	8.0	+10 %

Consistent gains: TEON improves perplexity by ~0.5–0.8 points over Muon and ~1.0–1.5 points over Adam across all model sizes.
Scalability: The benefit grows with model size, indicating that inter‑layer gradient correlations become more pronounced in larger networks.
Robustness: Experiments with different SVD approximations (rank‑k = 5, 10, 20) show negligible performance loss, confirming that cheap approximations are sufficient.

Practical Implications

Faster convergence for LLM pre‑training: Developers can achieve the same or better model quality with fewer training steps, translating to cost savings on GPU/TPU clusters.
Drop‑in replacement: TEON’s API mirrors Adam/Muon, so integrating it into existing PyTorch or JAX pipelines requires only a few lines of code.
Better stability in low‑precision regimes: The orthogonalization step mitigates gradient explosion/vanishing, making mixed‑precision (FP16/BF16) training more reliable.
Potential for downstream fine‑tuning: Since TEON yields a better‑initialized weight space, downstream fine‑tuning on domain‑specific data may converge faster and reach higher accuracy.

Limitations & Future Work

Computational overhead: Although modest, the extra 5–10 % compute may still be noticeable in ultra‑large‑scale runs (tens of billions of parameters).
Memory footprint of the gradient tensor: Stacking all layer gradients can strain memory on very deep models; the authors suggest a streaming or blockwise orthogonalization as a remedy.
Theoretical analysis limited to convex surrogates: The convergence proof assumes a locally convex approximation; extending guarantees to the full non‑convex landscape of Transformers remains open.

Future Directions

Exploring adaptive rank selection to further cut overhead.
Combining TEON with other second‑order tricks (e.g., K-FAC) for even faster convergence.
Applying tensor orthogonalization to other domains such as vision transformers or diffusion models.

Authors

Ruijie Zhang
Yequan Zhao
Ziyue Liu
Zhengyang Wang
Dongyang Li
Yupeng Su
Sijia Liu
Zheng Zhang

Paper Information

arXiv ID: 2601.23261v1
Categories: cs.LG, cs.AI
Published: January 30, 2026
PDF: Download PDF

[Paper] TEON: Tensorized Orthonormalization Beyond Layer-Wise Muon for Large Language Model Pre-Training

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Future Directions

Authors

Paper Information

Related posts

[Paper] VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

[Paper] End-to-end Optimization of Belief and Policy Learning in Shared Autonomy Paradigms

[Paper] Decoupled Diffusion Sampling for Inverse Problems on Function Spaces

[Paper] FOCUS: DLLMs Know How to Tame Their Compute Bound