[Paper] Why Smooth Stability Assumptions Fail for ReLU Learning

Published: (December 26, 2025 at 10:17 AM EST)
5 min read
Source: arXiv

Source: arXiv - 2512.22055v1

Overview

The paper Why Smooth Stability Assumptions Fail for ReLU Learning examines a hidden pitfall in many modern analyses of deep learning: they rely on smoothness (e.g., Lipschitz‑continuous gradients or bounded Hessians) that simply does not exist for networks built with ReLU activations. By constructing a minimal counterexample, the author shows that classic “smooth‑based” stability guarantees break down even when training looks perfectly well‑behaved in practice. The work also proposes the weakest possible nonsmooth condition that still lets us reason about stability, opening the door to more realistic theory for the ReLU‑dominated deep‑learning landscape.

Key Contributions

  • Formal impossibility result: Proves that no global uniform smoothness proxy (gradient Lipschitzness, Hessian bounds, etc.) can hold for ReLU networks, even in low‑dimensional, convex‑loss settings.
  • Concrete counterexample: Provides an explicit, easy‑to‑visualize ReLU network and loss where classic smooth‑based stability bounds are violated despite empirically stable training trajectories.
  • Minimal generalized‑derivative condition: Identifies a “generalized derivative” (Clarke subgradient) requirement that is both necessary and sufficient for restoring meaningful stability statements in nonsmooth settings.
  • Theoretical clarification: Shows why smooth approximations of ReLU (e.g., Softplus) can give misleading guarantees that do not transfer to the true ReLU model.
  • Framework suggestion: Outlines a roadmap for building stability analyses that respect the intrinsic nonsmooth nature of ReLU networks.

Methodology

  1. Problem framing: The author starts from the standard supervised learning setup (parameter vector θ, loss ℓ(θ)) and recalls the common smoothness assumptions used in stability proofs (e.g., ‖∇²ℓ(θ)‖ ≤ L).
  2. Construction of a minimal network: A single‑layer ReLU network with two neurons is paired with a simple quadratic loss. By carefully choosing the data point and initialization, the loss surface exhibits a “kink” where the gradient jumps discontinuously.
  3. Analytical breakdown: The paper derives the exact gradient and Hessian expressions on either side of the kink, showing that any global Lipschitz constant for the gradient would have to be infinite.
  4. Counterexample verification: Numerical simulations trace the gradient descent trajectory, confirming that the optimizer never crosses the kink in practice, which explains why empirical stability is observed despite the theoretical violation.
  5. Generalized derivative analysis: Using Clarke’s subdifferential, the author defines a relaxed smoothness condition (boundedness of the Clarke Jacobian) and proves that under this condition, standard stability arguments (e.g., bounded perturbation response) can be recovered.

The approach stays at a level that developers can follow: it relies on elementary calculus, a tiny network, and standard concepts from nonsmooth analysis rather than heavy functional‑analysis machinery.

Results & Findings

  • Impossibility of global smoothness: For the constructed ReLU network, the gradient is not Lipschitz continuous anywhere that matters, and the Hessian is undefined at the kink. Consequently, any theorem that assumes a finite global smoothness constant is inapplicable.
  • Empirical‑theoretical mismatch: Even though gradient descent never experiences the nondifferentiable point during training (so the loss curve looks smooth), the underlying theory cannot guarantee stability because the guarantee must hold uniformly over the entire parameter space.
  • Restored stability via Clarke subgradients: By bounding the norm of the Clarke Jacobian (a set‑valued generalization of the gradient), the author proves a version of the classic stability bound: small perturbations in the data or initialization lead to proportionally small changes in the final parameters.
  • Implication for smooth approximations: Replacing ReLU with a smooth surrogate (e.g., Softplus) yields a model that satisfies the smoothness assumptions, but the surrogate’s dynamics can diverge significantly from the true ReLU network, especially near activation boundaries.

Practical Implications

  • Robustness & certification tools: Many robustness verification frameworks (e.g., Lipschitz‑based certifiers) assume gradient Lipschitzness. This paper warns that such tools may give overly optimistic guarantees for ReLU models unless they incorporate nonsmooth analysis.
  • Optimizer design: Adaptive methods that rely on curvature estimates (e.g., L‑BFGS, second‑order Newton steps) need to handle the fact that the Hessian can be undefined or arbitrarily large. Practitioners might prefer first‑order methods or explicitly smooth the loss only where needed.
  • Model compression & pruning: Techniques that prune neurons based on gradient magnitude assume smooth gradients. Understanding the nondifferentiable “kink” structure can lead to more reliable pruning criteria that avoid inadvertently destabilizing the network.
  • Framework updates: Libraries such as PyTorch or JAX could expose Clarke‑subgradient utilities, enabling developers to write stability‑aware training loops that respect ReLU’s nonsmoothness.
  • Guidance for research‑to‑product pipelines: When translating theoretical guarantees (e.g., convergence rates) into production systems, engineers should verify whether the underlying assumptions hold for the actual ReLU architecture, not just for a smoothed proxy.

Limitations & Future Work

  • Scope of the counterexample: The impossibility proof is demonstrated on a minimal two‑neuron network; while the argument scales conceptually, extending it to deep, highly over‑parameterized networks may require additional technical work.
  • Clarke‑based bounds are still coarse: Bounding the Clarke Jacobian provides a theoretical fix, but the resulting constants can be pessimistic for large‑scale models, limiting practical tightness.
  • Empirical validation missing: The paper focuses on analytical arguments; systematic experiments on modern architectures (ResNets, Transformers) to measure how often training trajectories encounter nondifferentiable regions would strengthen the claim.
  • Tooling gap: No ready‑made software implementation of the proposed nonsmooth stability checks is provided, leaving a gap for immediate adoption.

Future research could explore tighter nonsmooth condition numbers, develop automated detection of “dangerous” activation boundaries during training, and integrate Clarke‑subgradient calculations into mainstream deep‑learning frameworks.

Authors

  • Ronald Katende

Paper Information

  • arXiv ID: 2512.22055v1
  • Categories: cs.LG, math.OC
  • Published: December 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »