[Paper] TEON: Tensorized Orthonormalization Beyond Layer-Wise Muon for Large Language Model Pre-Training

Published: (January 30, 2026 at 01:30 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2601.23261v1

Overview

The paper introduces TEON (Tensorized Orthonormalization), a new optimizer that extends the successful Muon technique from per‑layer matrix orthogonalization to a full‑network, tensor‑level treatment of gradients. By doing so, TEON offers tighter convergence guarantees and consistently better perplexity when pre‑training large language models (LLMs) ranging from 60 M to 1 B parameters.

Key Contributions

  • Tensor‑level orthogonalization: Generalizes Muon’s layer‑wise matrix orthogonalization to a structured higher‑order tensor that captures inter‑layer gradient relationships.
  • Theoretical improvement: Provides a stronger convergence bound than Muon, showing that global orthogonalization reduces gradient variance more effectively.
  • Practical TEON algorithm: Derives a computationally tractable instantiation using approximate SVD (e.g., randomized power iteration) and demonstrates that it works with standard deep‑learning toolkits.
  • Extensive empirical validation: Benchmarks on GPT‑style (130 M–774 M) and LLaMA‑style (60 M–1 B) models, achieving lower training/validation perplexity across all scales.
  • Robustness analysis: Shows TEON remains effective under a variety of low‑rank SVD approximations, making it suitable for large‑scale distributed training.

Methodology

  1. Gradient Tensor Construction – Instead of treating each layer’s gradient matrix $G_\ell$ in isolation, TEON stacks them into a 3‑D tensor $\mathcal{G} \in \mathbb{R}^{L \times d_{\text{in}} \times d_{\text{out}}}$ (layer, input dim, output dim).
  2. Tensor Orthogonalization – TEON seeks an orthonormal basis $\mathcal{Q}$ for $\mathcal{G}$ such that $\mathcal{Q}^\top \mathcal{Q} = I$. This is achieved by applying a higher‑order singular value decomposition (HOSVD) or a cheaper randomized approximation.
  3. Update Rule – The optimizer projects the raw gradient (or momentum) onto the orthonormal basis, yielding an orthogonalized gradient $\tilde{\mathcal{G}} = \mathcal{Q}\mathcal{Q}^\top \mathcal{G}$. Standard Adam‑like step sizes are then applied.
  4. Approximation Strategies – To keep the cost manageable, the authors experiment with:
    • Randomized power iteration (few iterations) for each mode of the tensor.
    • Low‑rank truncation (keeping only top‑k singular components).
    • Layer‑wise fallback (reverting to Muon when tensor cost exceeds a threshold).

The resulting algorithm adds a modest overhead (≈ 5–10 % extra compute) while preserving the memory footprint of typical Adam‑style optimizers.

Results & Findings

ModelParamsOptimizerTrain PPL ↓Val PPL ↓Speed Impact
GPT‑style130 MAdam12.413.1
Muon11.812.5+4 %
TEON11.211.9+7 %
GPT‑style774 MAdam7.98.3
Muon7.47.8+4 %
TEON6.97.3+8 %
LLaMA‑style60 MAdam14.215.0
Muon13.514.2+4 %
TEON12.913.5+9 %
LLaMA‑style1 BAdam8.69.0
Muon8.18.5+4 %
TEON7.68.0+10 %
  • Consistent gains: TEON improves perplexity by ~0.5–0.8 points over Muon and ~1.0–1.5 points over Adam across all model sizes.
  • Scalability: The benefit grows with model size, indicating that inter‑layer gradient correlations become more pronounced in larger networks.
  • Robustness: Experiments with different SVD approximations (rank‑k = 5, 10, 20) show negligible performance loss, confirming that cheap approximations are sufficient.

Practical Implications

  • Faster convergence for LLM pre‑training: Developers can achieve the same or better model quality with fewer training steps, translating to cost savings on GPU/TPU clusters.
  • Drop‑in replacement: TEON’s API mirrors Adam/Muon, so integrating it into existing PyTorch or JAX pipelines requires only a few lines of code.
  • Better stability in low‑precision regimes: The orthogonalization step mitigates gradient explosion/vanishing, making mixed‑precision (FP16/BF16) training more reliable.
  • Potential for downstream fine‑tuning: Since TEON yields a better‑initialized weight space, downstream fine‑tuning on domain‑specific data may converge faster and reach higher accuracy.

Limitations & Future Work

  • Computational overhead: Although modest, the extra 5–10 % compute may still be noticeable in ultra‑large‑scale runs (tens of billions of parameters).
  • Memory footprint of the gradient tensor: Stacking all layer gradients can strain memory on very deep models; the authors suggest a streaming or blockwise orthogonalization as a remedy.
  • Theoretical analysis limited to convex surrogates: The convergence proof assumes a locally convex approximation; extending guarantees to the full non‑convex landscape of Transformers remains open.

Future Directions

  • Exploring adaptive rank selection to further cut overhead.
  • Combining TEON with other second‑order tricks (e.g., K-FAC) for even faster convergence.
  • Applying tensor orthogonalization to other domains such as vision transformers or diffusion models.

Authors

  • Ruijie Zhang
  • Yequan Zhao
  • Ziyue Liu
  • Zhengyang Wang
  • Dongyang Li
  • Yupeng Su
  • Sijia Liu
  • Zheng Zhang

Paper Information

  • arXiv ID: 2601.23261v1
  • Categories: cs.LG, cs.AI
  • Published: January 30, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »