[Paper] Reversible Foundations: Training a 120B Sparse MoE through State-Preserving Scaling

Published: 5 days ago (June 5, 2026 at 11:48 AM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.07404v1

Overview

This paper reports on training a hundred-billion-parameter sparse mixture of experts on a single eight-GPU node, end to end. LightningLM 0.1V is a recurrence-backbone language model family grown in four stages from a small dense seed, through a 5B and a 9B mixture of experts, to a 120B model with 460 routed experts under top-12 routing. Each larger model is grown from the trained weights of the smaller one; active parameters rise monotonically from 1.78B at the dense seed to 5.93B at 120B (about 5% of the 118.67B stored). The full lineage runs on single nodes, the larger stages at 8K context, reaching a released training loss of 1.78 at 120B scale. This is a systems and experience report. It is organized around three disciplines. Reversibility: a reversible recurrence stack reconstructs activations in the backward pass instead of storing them, holding activation memory flat as the model grows. State-preserving growth: each expansion (dense to MoE, shallow to deep, few experts to many) is given as a reproducible principle paired with the failure that results from getting it wrong; several failures are silent. Single-node economics: the 120B trains through TQP, a strategy of quantized base expert weights and trained low-rank adapters that carries optimizer state on 2.26B adapter parameters rather than 100B+ resident in routed experts, cutting expert-path optimizer state by a factor of ~45. What is new is the integration of known primitives, not any primitive in isolation: one grown lineage running end to end on a single node, documented at practitioner level, with per-domain held-out loss as evidence that targeted capabilities (multilingual Indic competence, code) were learned by construction. Model family, tokenizer, and training code are released.

Key Contributions

This paper presents research in the following areas:

cs.LG

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.LG.

Authors

Rohan Shravan

Paper Information

arXiv ID: 2606.07404v1
Categories: cs.LG
Published: June 5, 2026
PDF: Download PDF

[Paper] Reversible Foundations: Training a 120B Sparse MoE through State-Preserving Scaling

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] How reliable are LLMs when it comes to playing dice?

[Paper] MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

[Paper] Sparse Subspace-to-Expert Sharing for Task-Agnostic Continual Learning

[Paper] Accelerated Decentralized Stochastic Gradient Descent for Strongly Convex Optimization