[Paper] TEON: Tensorized Orthonormalization Beyond Layer-Wise Muon for Large Language Model Pre-Training
Source: arXiv - 2601.23261v1
Overview
The paper introduces TEON (Tensorized Orthonormalization), a new optimizer that extends the successful Muon technique from per‑layer matrix orthogonalization to a full‑network, tensor‑level treatment of gradients. By doing so, TEON offers tighter convergence guarantees and consistently better perplexity when pre‑training large language models (LLMs) ranging from 60 M to 1 B parameters.
Key Contributions
- Tensor‑level orthogonalization: Generalizes Muon’s layer‑wise matrix orthogonalization to a structured higher‑order tensor that captures inter‑layer gradient relationships.
- Theoretical improvement: Provides a stronger convergence bound than Muon, showing that global orthogonalization reduces gradient variance more effectively.
- Practical TEON algorithm: Derives a computationally tractable instantiation using approximate SVD (e.g., randomized power iteration) and demonstrates that it works with standard deep‑learning toolkits.
- Extensive empirical validation: Benchmarks on GPT‑style (130 M–774 M) and LLaMA‑style (60 M–1 B) models, achieving lower training/validation perplexity across all scales.
- Robustness analysis: Shows TEON remains effective under a variety of low‑rank SVD approximations, making it suitable for large‑scale distributed training.
Methodology
- Gradient Tensor Construction – Instead of treating each layer’s gradient matrix $G_\ell$ in isolation, TEON stacks them into a 3‑D tensor $\mathcal{G} \in \mathbb{R}^{L \times d_{\text{in}} \times d_{\text{out}}}$ (layer, input dim, output dim).
- Tensor Orthogonalization – TEON seeks an orthonormal basis $\mathcal{Q}$ for $\mathcal{G}$ such that $\mathcal{Q}^\top \mathcal{Q} = I$. This is achieved by applying a higher‑order singular value decomposition (HOSVD) or a cheaper randomized approximation.
- Update Rule – The optimizer projects the raw gradient (or momentum) onto the orthonormal basis, yielding an orthogonalized gradient $\tilde{\mathcal{G}} = \mathcal{Q}\mathcal{Q}^\top \mathcal{G}$. Standard Adam‑like step sizes are then applied.
- Approximation Strategies – To keep the cost manageable, the authors experiment with:
- Randomized power iteration (few iterations) for each mode of the tensor.
- Low‑rank truncation (keeping only top‑k singular components).
- Layer‑wise fallback (reverting to Muon when tensor cost exceeds a threshold).
The resulting algorithm adds a modest overhead (≈ 5–10 % extra compute) while preserving the memory footprint of typical Adam‑style optimizers.
Results & Findings
| Model | Params | Optimizer | Train PPL ↓ | Val PPL ↓ | Speed Impact |
|---|---|---|---|---|---|
| GPT‑style | 130 M | Adam | 12.4 | 13.1 | – |
| Muon | 11.8 | 12.5 | +4 % | ||
| TEON | 11.2 | 11.9 | +7 % | ||
| GPT‑style | 774 M | Adam | 7.9 | 8.3 | – |
| Muon | 7.4 | 7.8 | +4 % | ||
| TEON | 6.9 | 7.3 | +8 % | ||
| LLaMA‑style | 60 M | Adam | 14.2 | 15.0 | – |
| Muon | 13.5 | 14.2 | +4 % | ||
| TEON | 12.9 | 13.5 | +9 % | ||
| LLaMA‑style | 1 B | Adam | 8.6 | 9.0 | – |
| Muon | 8.1 | 8.5 | +4 % | ||
| TEON | 7.6 | 8.0 | +10 % |
- Consistent gains: TEON improves perplexity by ~0.5–0.8 points over Muon and ~1.0–1.5 points over Adam across all model sizes.
- Scalability: The benefit grows with model size, indicating that inter‑layer gradient correlations become more pronounced in larger networks.
- Robustness: Experiments with different SVD approximations (rank‑k = 5, 10, 20) show negligible performance loss, confirming that cheap approximations are sufficient.
Practical Implications
- Faster convergence for LLM pre‑training: Developers can achieve the same or better model quality with fewer training steps, translating to cost savings on GPU/TPU clusters.
- Drop‑in replacement: TEON’s API mirrors Adam/Muon, so integrating it into existing PyTorch or JAX pipelines requires only a few lines of code.
- Better stability in low‑precision regimes: The orthogonalization step mitigates gradient explosion/vanishing, making mixed‑precision (FP16/BF16) training more reliable.
- Potential for downstream fine‑tuning: Since TEON yields a better‑initialized weight space, downstream fine‑tuning on domain‑specific data may converge faster and reach higher accuracy.
Limitations & Future Work
- Computational overhead: Although modest, the extra 5–10 % compute may still be noticeable in ultra‑large‑scale runs (tens of billions of parameters).
- Memory footprint of the gradient tensor: Stacking all layer gradients can strain memory on very deep models; the authors suggest a streaming or blockwise orthogonalization as a remedy.
- Theoretical analysis limited to convex surrogates: The convergence proof assumes a locally convex approximation; extending guarantees to the full non‑convex landscape of Transformers remains open.
Future Directions
- Exploring adaptive rank selection to further cut overhead.
- Combining TEON with other second‑order tricks (e.g., K-FAC) for even faster convergence.
- Applying tensor orthogonalization to other domains such as vision transformers or diffusion models.
Authors
- Ruijie Zhang
- Yequan Zhao
- Ziyue Liu
- Zhengyang Wang
- Dongyang Li
- Yupeng Su
- Sijia Liu
- Zheng Zhang
Paper Information
- arXiv ID: 2601.23261v1
- Categories: cs.LG, cs.AI
- Published: January 30, 2026
- PDF: Download PDF