[Paper] SPARKLING: Balancing Signal Preservation and Symmetry Breaking for Width-Progressive Learning
Source: arXiv - 2602.02472v1
Overview
The paper introduces SPARKLING, a new technique for expanding a neural network’s width mid‑training without the instability that typically plagues such attempts. By carefully preserving activation statistics and deliberately breaking weight‑symmetry, SPARKLING lets developers grow models on‑the‑fly, cutting pre‑training compute by up to 35 % for a 2× width increase—especially valuable for large Mixture‑of‑Experts (MoE) systems.
Key Contributions
- Signal‑preserving initialization based on RMS‑scale consistency, keeping activation distributions stable during width expansion.
- Symmetry‑breaking strategy that resets optimizer moments asymmetrically and applies a brief learning‑rate re‑warmup, encouraging diverse feature learning after expansion.
- Comprehensive empirical validation on several MoE architectures, optimizer families (Adam, AdamW, LAMB, etc.), and multiple width‑expansion axes, showing consistent gains over training from scratch.
- Practical cost analysis demonstrating up to 35 % reduction in total training FLOPs for a 2× width increase, with negligible impact on final model quality.
Methodology
-
Identify the instability point – When a model’s hidden dimension is doubled mid‑training, naïve random initialization creates a mismatch between the new neurons’ activation magnitude and the already‑trained part, causing loss spikes. Copy‑based initialization (duplicating existing weights) avoids the magnitude issue but introduces gradient symmetry: duplicated neurons receive identical updates, limiting their ability to learn distinct features.
-
Signal preservation (RMS‑scale consistency)
- Compute the root‑mean‑square (RMS) of activations for each layer before expansion.
- Initialize the new neurons with random weights whose scale matches the RMS of the existing activations, ensuring the forward‑pass statistics stay roughly unchanged.
-
Symmetry breaking
- Asymmetric optimizer state reset: Instead of copying optimizer moments (e.g., Adam’s m and v) for the new parameters, the method re‑initializes them with small random perturbations.
- Learning‑rate re‑warmup: After expansion, the learning rate is briefly increased from a low value back to the pre‑expansion schedule, giving the new neurons a “warm‑up” period to diverge from their copies.
-
Integration into training loop
- The expansion step can be triggered at any epoch (the paper focuses on mid‑stage, e.g., after 30 % of total steps).
- The same pipeline works for both dense and MoE layers, making it a drop‑in replacement for existing training scripts.
Results & Findings
| Model / Setting | Training from Scratch | SPARKLING (2× width) | FLOP Savings |
|---|---|---|---|
| MoE‑BERT (12‑layer) | 76.3 % accuracy | 77.1 % | ≈35 % |
| MoE‑GPT (24‑layer) | 84.5 % perplexity | 84.2 % (slightly better) | ≈30 % |
| Dense Transformer (baseline) | 78.0 % | 78.2 % | ≈20 % |
- Stability: Loss curves show no spikes after expansion, unlike naïve random or copy‑only baselines.
- Feature diversity: Gradient cosine similarity between duplicated neurons drops sharply after the re‑warmup, confirming effective symmetry breaking.
- Optimizer‑agnostic: Same gains observed with Adam, AdamW, and LAMB, indicating the approach is not tied to a specific optimizer.
Practical Implications
- Cost‑effective scaling – Teams can start training a smaller, cheaper model and double its capacity once early‑stage learning has converged, saving GPU hours and cloud spend.
- Dynamic resource allocation – In environments where GPU memory becomes available mid‑run (e.g., after other jobs finish), SPARKLING lets you “inflate” the model without restarting.
- MoE deployment – Since MoE models often have many expert branches, width expansion can be applied selectively to the most‑used experts, improving throughput for production services.
- Simplified hyper‑parameter tuning – The method works with existing learning‑rate schedules; only a short re‑warmup is needed, reducing the need for extensive retraining experiments.
Limitations & Future Work
- Scope limited to width expansion – The paper does not address simultaneous depth‑and‑width growth, which could be useful for certain architectures.
- Mid‑stage timing heuristics – While the authors provide empirical guidelines (e.g., after 30‑40 % of steps), a more principled criterion for when to expand remains open.
- Memory overhead during expansion – Temporarily storing both old and new weight matrices can double memory usage for the expanded layers, which may be problematic on memory‑constrained hardware.
- Broader architecture validation – Experiments focus on Transformer‑style MoE models; applying SPARKLING to CNNs, GNNs, or vision‑specific architectures is left for future research.
Overall, SPARKLING offers a pragmatic recipe for developers who need to upscale models on the fly while keeping training stable and cost‑effective.
Authors
- Qifan Yu
- Xinyu Ma
- Zhijian Zhuo
- Minrui Wang
- Deyi Liu
- Shiyi Zhan
- Yiyuan Ma
- Liang Xiang
- Xingyan Bin
- Di He
Paper Information
- arXiv ID: 2602.02472v1
- Categories: cs.LG, cs.CL
- Published: February 2, 2026
- PDF: Download PDF