[Paper] Saddle-to-Saddle Dynamics Explains A Simplicity Bias Across Neural Network Architectures

Published: (December 23, 2025 at 01:55 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.20607v1

Overview

A new theoretical paper uncovers why deep networks tend to learn “simple” solutions first and only later move to more complex ones—a phenomenon known as simplicity bias. By modeling the training trajectory as a sequence of saddle‑to‑saddle transitions, the authors provide a unified explanation that works for fully‑connected, convolutional, and attention‑based models.

Key Contributions

  • Unified saddle‑to‑saddle framework that captures simplicity bias across a broad family of architectures (FC, CNN, Transformers).
  • Concrete interpretation of “simplicity” for each architecture:
    • Linear nets → low‑rank weight matrices.
    • ReLU nets → few activation “kinks”.
    • ConvNets → small number of active convolutional kernels.
    • Self‑attention → few attention heads.
  • Mathematical analysis of gradient‑descent dynamics using fixed points, invariant manifolds, and plateaus, showing how training repeatedly lingers near saddles before jumping to a new manifold.
  • Insights into data distribution & initialization: explains why certain datasets or weight scales produce longer or more numerous learning plateaus.
  • Predictive formulas for the duration of each plateau as a function of network width, learning rate, and data statistics.

Methodology

  1. Model class – The authors consider a generic feed‑forward network expressed as a composition of linear maps and element‑wise nonlinearities, covering FC, convolutional, and multi‑head attention layers.
  2. Gradient‑descent dynamics – They write the continuous‑time gradient flow (ODE) for the parameters and identify saddle points (unstable equilibria) that correspond to low‑complexity solutions.
  3. Invariant manifolds – By linearizing around each saddle, they derive low‑dimensional subspaces (manifolds) that the trajectory follows for a long time, creating a “plateau”.
  4. Saddle‑to‑saddle transition – When the gradient component orthogonal to the current manifold becomes strong enough, the trajectory leaves the current saddle’s basin and moves toward the next, higher‑complexity saddle.
  5. Architecture‑specific mapping – They map the abstract notion of “dimension of the manifold” to concrete architectural quantities (rank, number of kinks, kernels, heads).
  6. Empirical validation – Small‑scale experiments on synthetic and real datasets illustrate the predicted plateaus and the progressive increase in the measured complexity metrics.

Results & Findings

  • Linear networks: Training first discovers the lowest‑rank solution that fits the data, then gradually adds rank‑1 components, matching the classic “rank‑increasing” behavior.
  • ReLU networks: The number of activation kinks (points where the piecewise‑linear function changes slope) grows step‑wise, mirroring the observed increase in model capacity during training.
  • Convolutional nets: Early epochs use only a few effective kernels; additional kernels become active only after a plateau, explaining why early filters often look generic (e.g., edge detectors).
  • Self‑attention models: The number of heads that contribute non‑trivially to the output rises over time, providing a theoretical basis for the empirical observation that attention heads “specialize” later in training.
  • Plateau duration: The theory predicts that the length of each plateau scales logarithmically with the ratio of the learning rate to the eigenvalue gap of the data covariance, and linearly with network width. Experiments confirm these scaling laws.

Practical Implications

  • Curriculum design – Knowing that networks naturally progress from low to high complexity suggests that data can be staged to align with these plateaus (e.g., start with coarse labels, add fine‑grained details later).
  • Early‑stopping heuristics – Monitoring the identified complexity metrics (rank, active kernels, heads) can signal when the model is still on a low‑complexity plateau, helping avoid premature stopping.
  • Architecture selection – If a task demands rapid acquisition of high‑complexity features (e.g., fine‑grained image details), designers might increase learning rates or use initialization schemes that shrink the early plateaus.
  • Debugging training stalls – Plateaus that are longer than predicted may indicate data distribution issues (e.g., highly correlated features) or suboptimal hyper‑parameters, guiding targeted interventions.
  • Resource allocation – Understanding that additional compute primarily pays off during the transitions between saddles can inform budgeting for large‑scale training runs (e.g., allocate more GPU hours around expected transition points).

Limitations & Future Work

  • The analysis assumes continuous‑time gradient flow and small learning rates; discrete‑step optimizers with momentum or adaptive schedules may deviate from the predicted dynamics.
  • Experiments are limited to relatively small models and synthetic datasets; scaling the framework to billions‑parameter Transformers remains an open challenge.
  • The current theory treats data distribution as static; extending it to non‑stationary or streaming data scenarios could broaden its applicability.
  • Future work could explore regularization effects (dropout, weight decay) on saddle‑to‑saddle transitions and investigate whether explicit architectural constraints can deliberately shape the simplicity‑bias trajectory.

Authors

  • Yedi Zhang
  • Andrew Saxe
  • Peter E. Latham

Paper Information

  • arXiv ID: 2512.20607v1
  • Categories: cs.LG
  • Published: December 23, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »