[Paper] Generalization at the Edge of Stability

Published: 2 days ago (April 21, 2026 at 01:59 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.19740v1

Overview

Modern deep‑learning practitioners have noticed that training with large learning rates—pushing the optimizer to the “edge of stability”—often yields surprisingly good test performance. This paper reframes that chaotic training regime as a random dynamical system that settles onto a low‑dimensional fractal attractor, and it derives a new sharpness dimension‑based generalization bound that explains why such instability can be beneficial.

Key Contributions

Random‑dynamical‑system view: Shows stochastic optimizers (SGD, Adam, etc.) behave like random dynamical systems whose long‑run states form fractal attractors rather than single points.
Sharpness dimension: Introduces a novel complexity measure based on the Lyapunov (fractal) dimension of the attractor, capturing the full Hessian spectrum instead of just its trace or spectral norm.
Generalization bound: Proves that test error scales with the sharpness dimension, linking chaotic dynamics directly to generalization performance.
Empirical validation: Demonstrates the theory on multilayer perceptrons and transformer models, reproducing the “edge‑of‑stability” learning curves and shedding light on the grokking phenomenon.
Practical diagnostics: Provides tools to estimate the sharpness dimension from training logs, enabling developers to monitor when a model is entering the beneficial chaotic regime.

Methodology

Model the optimizer as a random dynamical system (RDS).
- Each SGD/Adam update is treated as a stochastic map (x_{t+1}=f_{\theta_t}(x_t)+\xi_t) where (\xi_t) captures gradient noise.
Analyze the long‑term attractor.
- Using concepts from Lyapunov theory, the authors show that under large learning rates the RDS does not converge to a fixed point but to a fractal attractor with intrinsic (Lyapunov) dimension (d_L).
Define sharpness dimension.
- They compute (d_L) from the full Hessian spectrum and the determinants of its principal sub‑matrices, yielding a scalar that reflects how “sharp” or “flat” the loss landscape is in the chaotic regime.
Derive a generalization bound.
- By extending PAC‑Bayesian arguments to fractal attractors, they prove that the expected test loss is bounded by a term proportional to (\sqrt{d_L / n}) (with (n) training samples).
Experimental pipeline.
- Train MLPs on MNIST/Fashion‑MNIST and transformers on language modeling tasks with a sweep of learning rates.
- Estimate the sharpness dimension via stochastic Lanczos quadrature on the Hessian and compare against validation accuracy and grokking curves.

Results & Findings

Model	Learning‑rate regime	Observed behavior	Sharpness dimension (≈)	Test accuracy
MLP (2‑layer)	Small LR (< 0.01)	Stable convergence, modest accuracy	0.9	96%
MLP (2‑layer)	Edge‑of‑stability (≈ 0.1)	Oscillatory loss, higher accuracy	2.3	98.5%
Transformer (GPT‑small)	Edge‑of‑stability (≈ 5e‑4)	Periodic spikes in loss, grokking after many epochs	4.7	92% (vs 88% at small LR)

Fractal attractors appear only when the learning rate exceeds a critical threshold, matching the “edge of stability” identified in prior empirical work.
Sharpness dimension correlates strongly (Pearson ≈ 0.85) with final test performance across all experiments, outperforming traditional sharpness metrics (trace of Hessian, spectral norm).
In grokking experiments, the sharpness dimension drops sharply right before the sudden jump in test accuracy, suggesting that the model’s dynamics transition to a lower‑dimensional attractor that encodes a more robust solution.

Practical Implications

Learning‑rate tuning: Instead of treating large learning rates as risky, developers can deliberately push into the chaotic regime and monitor the sharpness dimension to ensure they stay on the “good” side of the edge.
Training diagnostics: The sharpness dimension can be estimated on‑the‑fly (e.g., every few hundred steps) using cheap Hessian‑vector products, giving an early warning if the optimizer is drifting into an overly chaotic region that harms generalization.
Model selection for limited data: Since the bound scales with (\sqrt{d_L/n}), models that naturally settle onto low‑dimensional attractors (e.g., certain transformer architectures) may be preferable when training data is scarce.
Understanding grokking: The theory provides a concrete explanation for why models sometimes memorize training data for many epochs before suddenly generalizing—this corresponds to a dimensional collapse of the attractor. Practitioners can exploit this by scheduling learning‑rate decay to trigger the collapse at a desired time.
Regularization alternatives: Traditional weight decay or batch‑norm aim to flatten the loss landscape; the sharpness dimension suggests that controlled chaos can be an alternative regularizer, potentially reducing the need for heavy explicit penalties.

Limitations & Future Work

Hessian estimation overhead: Accurate computation of the full Hessian spectrum remains expensive for very large models; the current approach relies on stochastic approximations that may be noisy.
Assumption of stationary noise: The RDS analysis assumes gradient noise is i.i.d., which may not hold for highly non‑stationary data streams or curriculum learning.
Scope of architectures: Experiments focus on relatively small MLPs and transformer‑style language models; extending the theory to convolutional nets, graph neural networks, or reinforcement‑learning agents is an open question.
Theoretical tightness: The derived bound, while insightful, is still loose compared to empirical gaps; refining the constants and exploring tighter fractal‑dimension‑based bounds is a promising direction.

Bottom line: By viewing large‑learning‑rate training through the lens of random dynamical systems and introducing the sharpness dimension, this work equips developers with a new, theoretically grounded tool to harness the edge‑of‑stability regime for better generalization.

Authors

Mario Tuci
Caner Korkmaz
Umut Şimşekli
Tolga Birdal

Paper Information

arXiv ID: 2604.19740v1
Categories: cs.LG, cs.AI, cs.CV, stat.ML
Published: April 21, 2026
PDF: Download PDF

[Paper] Generalization at the Edge of Stability

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Seeing Fast and Slow: Learning the Flow of Time in Videos

[Paper] When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

[Paper] Addressing Image Authenticity When Cameras Use Generative AI

[Paper] Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos