[Paper] It's not a Lottery, it's a Race: Understanding How Gradient Descent Adapts the Network's Capacity to the Task

Published: 2 months ago (February 4, 2026 at 01:22 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.04832v1

Overview

Hannah Pinson’s paper tackles a puzzling question that many of us have observed in practice: why does gradient descent seem to “shrink” a neural network’s capacity to just what the task needs? By zooming in on the dynamics of individual ReLU neurons in a single‑hidden‑layer network, the work uncovers three simple principles—mutual alignment, unlocking, and racing—that explain how training automatically organizes and prunes the model. The findings also shed light on the famous Lottery Ticket Hypothesis, showing why a few lucky initializations end up dominating after training.

Key Contributions

Three dynamical principles (mutual alignment, unlocking, racing) that together describe how gradient descent reallocates capacity among neurons.
Analytical proof that these principles cause redundant neurons to merge or become negligible, providing a theoretical basis for post‑training pruning.
Mechanistic explanation of the Lottery Ticket Hypothesis, linking high‑norm weight growth to favorable initial conditions identified by the three principles.
Empirical validation on synthetic and real‑world datasets (including MNIST and CIFAR‑10) demonstrating the predicted neuron‑level behavior.
Practical guidelines for designing initialization schemes and pruning strategies that align with the identified dynamics.

Methodology

Model Setup – The study focuses on a single hidden‑layer network with ReLU activations, a setting that is mathematically tractable yet expressive enough to capture key phenomena.
Neuron‑Level Dynamics – By writing the gradient descent update for each hidden neuron’s weight vector, the author isolates three interacting forces:
- Mutual Alignment: neurons with similar input‑space directions gradually align, reducing redundancy.
- Unlocking: once a neuron’s direction aligns, its magnitude can increase (“unlock”) without destabilizing the loss.
- Racing: neurons compete for the same feature; the one that first reaches a critical norm dominates, while the others are suppressed.
Theoretical Analysis – Using tools from dynamical systems and convex geometry, the paper proves that under mild assumptions these forces drive the network toward a low‑effective‑capacity configuration.
Experiments – Simulations track weight norms, pairwise cosine similarities, and loss trajectories. The author also prunes low‑norm neurons after training to verify that performance remains unchanged, confirming the “capacity reduction” effect.

The approach stays at a level that developers can follow: think of each neuron as a “player” in a game where alignment, unlocking, and racing dictate who stays active.

Results & Findings

Observation	What the paper shows
Neuron alignment	Cosine similarity between many hidden units rises sharply early in training, indicating they are learning the same feature direction.
Weight norm divergence	A small subset of neurons quickly attains much larger norms (the “racing” winners), while others stay near zero.
Effective capacity drop	Pruning neurons whose norms fall below a tiny threshold (e.g., 1e‑4) does not hurt test accuracy, confirming that the network has already “compressed” itself.
Lottery ticket link	Neurons that start with favorable initial alignment (i.e., close to the optimal direction) are the ones that win the race, providing a concrete mechanism for why certain random seeds produce “winning tickets.”
Generalization	Networks that undergo stronger alignment (e.g., with higher learning rates) tend to generalize better, suggesting that controlled capacity reduction is beneficial.

Overall, the experiments validate the three principles across both synthetic tasks (where ground truth is known) and standard vision benchmarks.

Practical Implications

Smarter Pruning Pipelines – Instead of heuristic magnitude‑based pruning, developers can monitor alignment and norm‑racing during training to identify truly redundant neurons early on.
Initialization Strategies – Seeding weights with a slight bias toward diverse directions (e.g., orthogonal initialization) can reduce the number of “racing” collisions, leading to more balanced networks and potentially better robustness.
Learning‑Rate Schedules – Aggressive early learning rates amplify mutual alignment, which may be a cheap way to encourage capacity reduction before fine‑tuning.
Model Compression – The theory justifies aggressive post‑training compression (e.g., weight‑sharing or neuron merging) because the network has already collapsed equivalent units.
Lottery Ticket Search – Instead of exhaustive rewinding, one could track early‑phase norm growth to spot promising “tickets” on‑the‑fly, cutting down the compute needed for lottery‑ticket experiments.

For engineers building edge‑AI or resource‑constrained services, these insights translate into lighter models with little or no loss in accuracy, and training recipes that naturally produce compressible networks.

Limitations & Future Work

Single‑Layer Focus – The analysis is limited to one hidden layer; extending the principles to deep, multi‑layer architectures remains an open challenge.
ReLU Specificity – While ReLU is ubiquitous, it’s unclear how the dynamics change with other activations (e.g., Swish, GELU).
Assumption of Small Learning Rates – Some proofs rely on infinitesimal step sizes; real‑world training often uses larger, adaptive rates.
Empirical Scope – Experiments cover vision benchmarks; testing on NLP or reinforcement‑learning tasks would strengthen the claim of universality.
Interaction with Regularization – The paper does not fully explore how dropout, weight decay, or batch norm interact with the three principles.

Future work could aim to generalize the theory to deep nets, investigate activation‑agnostic dynamics, and integrate the principles into automated model‑compression toolchains.

Authors

Hannah Pinson

Paper Information

arXiv ID: 2602.04832v1
Categories: cs.LG, cs.AI, cs.CV, cs.NE
Published: February 4, 2026
PDF: Download PDF

[Paper] It's not a Lottery, it's a Race: Understanding How Gradient Descent Adapts the Network's Capacity to the Task

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Reliable Mislabel Detection for Video Capsule Endoscopy Data

[Paper] PANC: Prior-Aware Normalized Cut for Object Segmentation

[Paper] Vision Transformer Finetuning Benefits from Non-Smooth Components

[Paper] NanoFLUX: Distillation-Driven Compression of Large Text-to-Image Generation Models for Mobile Devices