[Paper] Every Feedforward Neural Network Definable in an o-Minimal Structure Has Finite Sample Complexity

Published: (May 7, 2026 at 09:26 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2605.07097v1

Overview

A new theoretical breakthrough shows that any fixed‑size feedforward neural network whose layers are built from “tame” mathematical operations (formally, those definable in an o‑minimal structure) is guaranteed to learn from a finite amount of data in the agnostic PAC framework. This result covers the most common modern architectures—MLPs, CNNs, GNNs, and even transformers with fixed sequence length—without imposing artificial bounds on the network’s weights.

Key Contributions

  • General PAC learnability theorem for all feedforward networks whose layers are o‑minimal definable, regardless of activation function or parameter magnitude.
  • Unified treatment of diverse architectures (MLPs, CNNs, GNNs, transformers) and typical building blocks (linear maps, residual links, attention, pooling, normalization, positional encodings).
  • Proof that finite‑sample learnability is a baseline property of tame feedforward computation, not a special case of specific activations or VC‑dimension tricks.
  • Conceptual shift: moves the research focus from “does this architecture learn?” to “what inductive biases, symmetries, and optimization properties does it bring?”

Methodology

  1. O‑minimal structures – a framework from model theory that captures “well‑behaved” sets and functions (e.g., semi‑algebraic, sub‑analytic). The authors show that most layers used in practice belong to such structures.
  2. Agnostic PAC analysis – they work in the most general learning setting where no assumption is made about the data‑generating distribution.
  3. Finite‑sample complexity bound – by leveraging the tame geometry of o‑minimal definable functions, they derive a uniform convergence guarantee that yields a sample‑size bound depending only on the network’s architecture (depth, width, number of definable operations) and not on the magnitude of its parameters.
  4. Layer‑wise composition – the proof proceeds by showing that the composition of o‑minimal definable layers remains o‑minimal, preserving the learnability property throughout the network.

Results & Findings

  • Finite sample complexity: For any fixed architecture, there exists a polynomial‑in‑(1/\epsilon) and (\log(1/\delta)) bound on the number of training examples needed to achieve error (\epsilon) with confidence (1-\delta).
  • Parameter‑agnostic: The bound holds even when weights are allowed to be arbitrarily large, eliminating the need for norm‑based regularization in the theoretical guarantee.
  • Broad applicability: The theorem subsumes existing VC‑dimension results for specific activations (e.g., ReLU) and extends them to a much larger class of functions, including smooth activations, rational functions, and many custom layers used in industry.

Practical Implications

  • Design freedom: Engineers can experiment with exotic activation functions or normalization schemes without fearing a loss of theoretical learnability, as long as the operations stay within an o‑minimal structure.
  • Benchmarking shift: Since learnability is now a given for any reasonable feedforward design, performance comparisons can focus on inductive bias (e.g., equivariance in GNNs), scalability, and optimization dynamics rather than on “does this architecture even learn?”.
  • Safety & verification: The tame‑geometry perspective aligns with formal verification tools that already rely on semi‑algebraic reasoning, potentially easing the integration of provable guarantees into production pipelines.
  • Curriculum for AutoML: Automated architecture search can safely explore a larger search space (any o‑minimal layer) without needing to embed separate learnability checks.

Limitations & Future Work

  • Fixed architecture size: The guarantee applies only to networks with a pre‑specified number of layers and neurons; it does not address architectures that grow dynamically (e.g., neural architecture search that adds layers on the fly).
  • Sequence length restriction for transformers: The result holds for transformers with a bounded input length; extending to unbounded or variable‑length sequences remains open.
  • Optimization not covered: While sample complexity is finite, the theorem does not guarantee that stochastic gradient descent (or any practical optimizer) will find a good solution within reasonable time.
  • Beyond feedforward: Recurrent networks, continuous‑time models, and other non‑feedforward structures fall outside the current o‑minimal framework and are slated for future investigation.

Bottom line: For developers building modern deep‑learning systems, this work assures that as long as you stay within the “tame” toolbox of layers and keep the network size fixed, you have a solid PAC‑learning foundation. The real battle now shifts to how you shape the network’s inductive biases and how you train it efficiently.

Authors

  • Anastasis Kratsios
  • Gregory Cousins
  • Haitz Sáez de Ocáriz Borde
  • Bum Jun Kim
  • Simone Brugiapaglia

Paper Information

  • arXiv ID: 2605.07097v1
  • Categories: stat.ML, cs.LG, cs.NE, math.LO, math.ST
  • Published: May 8, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Normalizing Trajectory Models

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coar...