[Paper] SPARKLING: Balancing Signal Preservation and Symmetry Breaking for Width-Progressive Learning

Published: 1 day ago (February 2, 2026 at 01:52 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.02472v1

Overview

The paper introduces SPARKLING, a new technique for expanding a neural network’s width mid‑training without the instability that typically plagues such attempts. By carefully preserving activation statistics and deliberately breaking weight‑symmetry, SPARKLING lets developers grow models on‑the‑fly, cutting pre‑training compute by up to 35 % for a 2× width increase—especially valuable for large Mixture‑of‑Experts (MoE) systems.

Key Contributions

Signal‑preserving initialization based on RMS‑scale consistency, keeping activation distributions stable during width expansion.
Symmetry‑breaking strategy that resets optimizer moments asymmetrically and applies a brief learning‑rate re‑warmup, encouraging diverse feature learning after expansion.
Comprehensive empirical validation on several MoE architectures, optimizer families (Adam, AdamW, LAMB, etc.), and multiple width‑expansion axes, showing consistent gains over training from scratch.
Practical cost analysis demonstrating up to 35 % reduction in total training FLOPs for a 2× width increase, with negligible impact on final model quality.

Methodology

Identify the instability point – When a model’s hidden dimension is doubled mid‑training, naïve random initialization creates a mismatch between the new neurons’ activation magnitude and the already‑trained part, causing loss spikes. Copy‑based initialization (duplicating existing weights) avoids the magnitude issue but introduces gradient symmetry: duplicated neurons receive identical updates, limiting their ability to learn distinct features.
Signal preservation (RMS‑scale consistency)
- Compute the root‑mean‑square (RMS) of activations for each layer before expansion.
- Initialize the new neurons with random weights whose scale matches the RMS of the existing activations, ensuring the forward‑pass statistics stay roughly unchanged.
Symmetry breaking
- Asymmetric optimizer state reset: Instead of copying optimizer moments (e.g., Adam’s m and v) for the new parameters, the method re‑initializes them with small random perturbations.
- Learning‑rate re‑warmup: After expansion, the learning rate is briefly increased from a low value back to the pre‑expansion schedule, giving the new neurons a “warm‑up” period to diverge from their copies.
Integration into training loop
- The expansion step can be triggered at any epoch (the paper focuses on mid‑stage, e.g., after 30 % of total steps).
- The same pipeline works for both dense and MoE layers, making it a drop‑in replacement for existing training scripts.

Results & Findings

Model / Setting	Training from Scratch	SPARKLING (2× width)	FLOP Savings
MoE‑BERT (12‑layer)	76.3 % accuracy	77.1 %	≈35 %
MoE‑GPT (24‑layer)	84.5 % perplexity	84.2 % (slightly better)	≈30 %
Dense Transformer (baseline)	78.0 %	78.2 %	≈20 %

Stability: Loss curves show no spikes after expansion, unlike naïve random or copy‑only baselines.
Feature diversity: Gradient cosine similarity between duplicated neurons drops sharply after the re‑warmup, confirming effective symmetry breaking.
Optimizer‑agnostic: Same gains observed with Adam, AdamW, and LAMB, indicating the approach is not tied to a specific optimizer.

Practical Implications

Cost‑effective scaling – Teams can start training a smaller, cheaper model and double its capacity once early‑stage learning has converged, saving GPU hours and cloud spend.
Dynamic resource allocation – In environments where GPU memory becomes available mid‑run (e.g., after other jobs finish), SPARKLING lets you “inflate” the model without restarting.
MoE deployment – Since MoE models often have many expert branches, width expansion can be applied selectively to the most‑used experts, improving throughput for production services.
Simplified hyper‑parameter tuning – The method works with existing learning‑rate schedules; only a short re‑warmup is needed, reducing the need for extensive retraining experiments.

Limitations & Future Work

Scope limited to width expansion – The paper does not address simultaneous depth‑and‑width growth, which could be useful for certain architectures.
Mid‑stage timing heuristics – While the authors provide empirical guidelines (e.g., after 30‑40 % of steps), a more principled criterion for when to expand remains open.
Memory overhead during expansion – Temporarily storing both old and new weight matrices can double memory usage for the expanded layers, which may be problematic on memory‑constrained hardware.
Broader architecture validation – Experiments focus on Transformer‑style MoE models; applying SPARKLING to CNNs, GNNs, or vision‑specific architectures is left for future research.

Overall, SPARKLING offers a pragmatic recipe for developers who need to upscale models on the fly while keeping training stable and cost‑effective.

Authors

Qifan Yu
Xinyu Ma
Zhijian Zhuo
Minrui Wang
Deyi Liu
Shiyi Zhan
Yiyuan Ma
Liang Xiang
Xingyan Bin
Di He

Paper Information

arXiv ID: 2602.02472v1
Categories: cs.LG, cs.CL
Published: February 2, 2026
PDF: Download PDF

[Paper] SPARKLING: Balancing Signal Preservation and Symmetry Breaking for Width-Progressive Learning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Reward-free Alignment for Conflicting Objectives

[Paper] RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System

[Paper] RE-TRAC: REcursive TRAjectory Compression for Deep Search Agents

[Paper] MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents