[Paper] On the implicit regularization of Langevin dynamics with projected noise

Published: 3 days ago (February 12, 2026 at 01:45 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.12257v1

Overview

The paper investigates how symmetry in over‑parameterized models shapes the behavior of stochastic gradient descent (SGD) when it is modeled by Langevin dynamics. By projecting the random noise onto directions orthogonal to a group of symmetries, the authors uncover a new form of implicit regularization that arises purely from the geometry of the symmetry group, offering fresh insight into why SGD often finds good solutions in deep learning.

Key Contributions

Projected‑noise Langevin dynamics: Introduces a mathematically rigorous version of SGD where stochastic perturbations are confined to directions that do not move the parameters along symmetry orbits.
Equivalence to isotropic diffusion with extra drift: Shows that, when both the initial distribution and the target distribution respect the symmetry, the projected‑noise process has the same law as a standard Langevin diffusion plus a deterministic drift term.
Geometric interpretation of the drift: Identifies the extra drift as the negative gradient of the log volume of the group orbit, i.e., the mean curvature of the symmetry manifold.
Coupling construction: Provides an explicit coupling between the projected‑noise process, the isotropic process, and a third process evolving on the symmetry group itself, establishing the equivalence in law.
Implications for over‑parameterized models: Offers a concrete mechanism by which symmetry‑induced regularization can bias SGD toward “simpler” solutions without any explicit penalty.

Methodology

Model setup:
- Consider a smooth parameter space ( \Theta ) on which a compact isometric Lie group ( G ) acts (e.g., permutations of neurons, weight‑scaling symmetries).
- Define the standard overdamped Langevin SDE:
  [ d\theta_t = -\nabla V(\theta_t),dt + \sqrt{2\beta^{-1}},dW_t, ]
  where ( V ) is the loss (potential) and ( W_t ) is standard Brownian motion.
- Project the noise onto the orthogonal complement of the tangent space to the group orbit, yielding
  [ d\theta_t = -\nabla V(\theta_t),dt + \sqrt{2\beta^{-1}},\Pi_{\theta_t}^\perp dW_t, ]
  where ( \Pi_{\theta}^\perp ) removes components along symmetry directions.
Coupling via a group process:
- Introduce a stochastic process ( g_t \in G ) that evolves on the group itself, driven by the same Brownian motion but projected onto the tangent space of the orbit.
- Show that the pair ((\theta_t, g_t)) evolves as a joint diffusion whose marginal on ( \theta_t ) matches the projected‑noise dynamics.
Deriving the extra drift:
- By applying Itô’s formula to the change of variables that “undo” the group action, the authors isolate a deterministic term that depends on the Jacobian determinant of the orbit map.
- This term simplifies to (-\nabla \log \operatorname{vol}(G!\cdot!\theta)), i.e., the gradient of the negative log‑orbit volume, which is precisely the mean curvature vector of the orbit.
Equivalence proof:
- Demonstrate that the projected‑noise SDE and the standard isotropic Langevin SDE with the additional drift have identical finite‑dimensional distributions, establishing the claimed equivalence in law.

Results & Findings

Theorem (Implicit regularization): If the initial density ( \rho_0 ) and the target (stationary) density ( \rho_\infty \propto e^{-\beta V} ) are invariant under ( G ), then the projected‑noise Langevin dynamics is statistically indistinguishable from a standard Langevin diffusion with an extra drift term (-\nabla \log \operatorname{vol}(G!\cdot!\theta)).
Geometric insight: The extra drift pushes the trajectory toward regions where the symmetry orbit has smaller volume, effectively preferring parameter configurations that are “less redundant” under the group action.
Mean curvature connection: The drift equals the mean curvature vector of the orbit manifold, linking stochastic optimization to classic differential geometry.

Practical Implications

Understanding SGD’s bias: In deep networks with weight‑sharing, permutation, or scaling symmetries, the noise injected by minibatch SGD naturally aligns with the projected‑noise model. The derived drift suggests that SGD implicitly penalizes highly symmetric (high‑volume) solutions, which may explain its tendency to find flatter minima.
Designing better optimizers: By explicitly adding a curvature‑based regularizer (e.g., (-\log) orbit volume) or by shaping the noise to respect model symmetries, practitioners could steer training toward more generalizable solutions without hand‑crafted penalties.
Model compression & pruning: Since the drift favors low‑orbit‑volume regions, it may naturally encourage parameter configurations that are easier to compress (fewer redundant degrees of freedom). This insight could guide new compression‑aware training regimes.
Robustness to over‑parameterization: The theory provides a principled reason why heavily over‑parameterized models still generalize: symmetry‑induced regularization acts as an invisible “Occam’s razor” during training.

Limitations & Future Work

Assumption of exact symmetry: The analysis requires a perfectly isometric group action and invariant initial/target densities. Real‑world networks often have only approximate symmetries (e.g., due to batch‑norm or dropout).
Compact Lie groups: Results are proved for compact groups; extending to non‑compact or discrete symmetry groups (e.g., ReLU activation patterns) remains open.
Discrete SGD vs. continuous Langevin: While Langevin dynamics is a useful proxy, minibatch SGD introduces additional discretization effects and non‑Gaussian noise that are not captured here.
Computational tractability: Computing the orbit volume or its gradient in high‑dimensional neural nets is non‑trivial; future work could explore efficient estimators or surrogate regularizers.

Authors

Govind Menon
Austin J. Stromme
Adrien Vacher

Paper Information

arXiv ID: 2602.12257v1
Categories: math.PR, cs.AI
Published: February 12, 2026
PDF: Download PDF

[Paper] On the implicit regularization of Langevin dynamics with projected noise

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment

[Paper] UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

[Paper] AttentionRetriever: Attention Layers are Secretly Long Document Retrievers

[Paper] Agentic Test-Time Scaling for WebAgents