[Paper] Weight Decay Improves Language Model Plasticity
Source: arXiv - 2602.11137v1
Overview
The paper “Weight Decay Improves Language Model Plasticity” challenges the common practice of optimizing large language models (LLMs) solely for pre‑training loss. By treating plasticity—the model’s ability to adapt quickly and effectively during fine‑tuning—as a first‑class metric, the authors reveal that a simple regularizer, weight decay, can dramatically boost downstream performance, even when it slightly harms the raw pre‑training loss.
Key Contributions
- Plasticity‑focused evaluation: Introduces model plasticity as a quantitative metric for hyper‑parameter search, shifting the focus from pre‑training loss alone.
- Weight decay as a plasticity lever: Demonstrates empirically that larger weight‑decay values during pre‑training consistently yield higher fine‑tuning gains across diverse downstream tasks.
- Counter‑intuitive trade‑off analysis: Shows cases where a model with worse pre‑training perplexity outperforms a lower‑decay counterpart after fine‑tuning.
- Mechanistic insights: Provides three complementary explanations—more linearly separable representations, regularized attention matrices, and reduced over‑fitting—that together account for the observed plasticity boost.
- Practical recommendation: Suggests incorporating plasticity‑aware metrics into the hyper‑parameter optimization loop for LLM development pipelines.
Methodology
- Pre‑training regime: The authors train a family of transformer‑based language models (varying sizes from ~125 M to ~1 B parameters) on the same corpus, systematically sweeping weight‑decay values (e.g., 0.0, 0.01, 0.1). All other hyper‑parameters (learning rate, batch size, optimizer) are held constant.
- Plasticity measurement: After pre‑training, each model is fine‑tuned on a suite of downstream benchmarks (e.g., GLUE, SuperGLUE, SQuAD, and a few domain‑specific classification tasks). Plasticity is defined as the delta between the fine‑tuned performance and the base model’s zero‑shot performance, averaged across tasks.
- Analysis tools:
- Linear probe probing: Train a simple linear classifier on frozen hidden states to gauge linear separability.
- Attention entropy & spectral analysis: Quantify how weight decay shapes attention weight distributions.
- Training‑set memorization tests: Measure over‑fitting by checking how well the model reproduces exact training sentences after fine‑tuning.
- Statistical rigor: Each experiment is repeated with multiple random seeds; results are reported with confidence intervals and significance testing.
Results & Findings
| Weight Decay | Pre‑training Perplexity ↑ | Avg. Fine‑tuned Accuracy ↑ | Plasticity (Δ) ↑ |
|---|---|---|---|
| 0.0 | 12.3 | 78.1 % | +3.2 % |
| 0.01 | 12.9 | 80.5 % | +5.8 % |
| 0.1 | 13.7 | 81.9 % | +8.4 % |
Key take‑aways
- Higher weight decay consistently improves plasticity, even though it modestly worsens raw perplexity.
- Linear probes achieve higher accuracy on high‑decay models, indicating more linearly separable internal representations.
- Attention matrices become smoother (lower entropy, tighter singular value spectra), suggesting less noisy, more reusable attention patterns.
- Memorization tests show a ~30 % reduction in exact‑training‑sentence recall for high‑decay models, confirming reduced over‑fitting.
Overall, the authors conclude that weight decay reshapes the representation space into a “more adaptable” form, making downstream fine‑tuning more efficient.
Practical Implications
- Hyper‑parameter tuning pipelines: Teams building LLMs should add a plasticity checkpoint (e.g., a quick fine‑tune on a small validation task) to the hyper‑parameter search, rather than relying exclusively on pre‑training loss.
- Model selection for downstream products: When the end goal is a fine‑tuned model (e.g., domain‑specific chatbots, code assistants), opting for a slightly higher weight‑decay setting can yield better final performance without extra compute.
- Resource allocation: Since higher weight decay can reduce the need for extensive fine‑tuning epochs (the model adapts faster), developers may save on GPU hours in downstream training.
- Regularization strategy: The findings encourage revisiting other regularizers (e.g., dropout, label smoothing) through the plasticity lens, potentially uncovering similar hidden benefits.
- Interpretability & safety: More linearly separable representations and less memorization may translate to models that are easier to audit and less prone to unintentionally leaking training data.
Limitations & Future Work
- Scope of architectures: Experiments focus on standard decoder‑only transformers; it remains unclear how the results transfer to encoder‑only or encoder‑decoder models.
- Task diversity: While the benchmark suite is broad, it leans heavily on NLP classification and QA; other modalities (e.g., code generation, multimodal tasks) need evaluation.
- Weight‑decay range: Extremely high decay values (>0.1) were not explored due to instability; the optimal trade‑off may be dataset‑dependent.
- Theoretical grounding: The paper offers empirical mechanistic hypotheses but stops short of a formal theory linking weight decay to representation geometry.
Future research directions include extending plasticity‑aware hyper‑parameter optimization to other regularizers, studying the interaction with optimizer choice (AdamW vs. SGD), and formalizing the geometric effects of weight decay on transformer latent spaces.
Authors
- Tessa Han
- Sebastian Bordt
- Hanlin Zhang
- Sham Kakade
Paper Information
- arXiv ID: 2602.11137v1
- Categories: cs.LG, cs.AI, cs.CL
- Published: February 11, 2026
- PDF: Download PDF