[Paper] Necessary and sufficient conditions for universality of Kolmogorov-Arnold networks
Source: arXiv - 2604.23765v1
Overview
The paper investigates when Kolmogorov‑Arnold Networks (KANs) can approximate any continuous function on a compact domain. By focusing on the edge functions (the scalar functions applied along each network edge), the author pinpoints the minimal non‑linear ingredient needed for universal approximation. The results tighten the theoretical guarantees for KANs and give developers concrete guidance on how to build expressive, yet lightweight, KAN‑based models.
Key Contributions
- Single non‑affine edge suffices for deep KANs – any deep KAN whose edges are either affine or a fixed continuous non‑affine function σ is dense in C(K).
- Two‑layer KANs need a non‑polynomial σ – for networks with exactly two hidden layers, universality holds iff σ is not a polynomial.
- Finite affine set replaces the whole affine family – a small, fixed collection of affine functions (as few as five) together with σ already guarantees universality, regardless of depth.
- Constructive affine families – for every non‑affine σ there exists a concrete finite set A₍σ₎ of affine maps such that KANs using only functions from A₍σ₎ ∪ {σ} remain universal.
- Spline‑based edge parameterization is universal – the recent spline‑edge formulation (Liu et al., 2024) retains universal approximation even when spline degree and knot locations are fixed beforehand.
Methodology
- Theoretical framework – The analysis builds on the classical Kolmogorov superposition theorem, which states that any multivariate continuous function can be expressed as a finite sum of univariate continuous functions composed with linear forms.
- Edge‑function classification – Edge functions are split into two categories: affine (linear + bias) and a single “special” continuous function σ.
- Constructive approximation proofs
- For deep networks, the author shows how to embed the Kolmogorov representation using only σ for the non‑linear part, while the affine edges handle the required linear combinations.
- For two‑layer networks, a polynomial σ fails to generate the needed richness; a non‑polynomial σ can reproduce the required basis functions.
- Finite affine families – By exploiting linear independence of a small set of affine maps, the paper proves that any additional affine function can be expressed as a linear combination of a fixed finite basis, preserving universality.
- Spline edge analysis – The spline parameterization is treated as a specific σ (a piecewise‑polynomial). The author shows that, even with a predetermined knot sequence and degree, the spline family satisfies the non‑polynomial condition, thus guaranteeing universality.
Results & Findings
| Setting | Condition on σ | Universality? | Remarks |
|---|---|---|---|
| Deep KAN (≥ 3 hidden layers) | σ non‑affine (any continuous non‑linear shape) | Yes | One non‑affine edge is enough; all other edges may be purely affine. |
| Two‑layer KAN | σ non‑polynomial (e.g., ReLU, tanh, spline) | Yes | Polynomial σ (including quadratic) cannot achieve density. |
| Affine edge set | Replace “all affine” by a finite set (≥ 5) | Yes (with any suitable σ) | Reduces the design space without sacrificing expressive power. |
| Spline‑edge KAN | Fixed degree & knot sequence, σ = spline basis | Yes | Confirms Liu et al.’s empirical success on theoretical grounds. |
In plain terms, the paper proves that adding just one well‑chosen non‑linear scalar function to the edge repertoire unlocks the full approximation power of KANs, and that the rest of the network can be built from a tiny, predetermined toolbox of linear maps.
Practical Implications
- Simplified architecture design – When building a KAN, you don’t need a zoo of activation functions. Choose any continuous non‑affine σ (e.g., ReLU, sigmoid, a spline) and pair it with a handful of fixed affine transforms; the network will be universal.
- Parameter‑budget friendly models – Because only a finite set of affine maps is required, you can pre‑compute or hard‑code these linear transformations, reducing runtime overhead and memory footprint.
- Guidance for activation selection – For shallow (two‑layer) KANs, avoid polynomial activations (e.g., pure quadratic) and opt for non‑polynomial ones. This explains why ReLU‑based KANs work well in practice.
- Confidence in spline‑based KANs – Developers using the spline edge parameterization can now rely on a solid universal approximation guarantee even if they fix the spline degree and knot positions, simplifying hyper‑parameter tuning.
- Potential for hardware acceleration – Since the affine part can be limited to a small, known set, custom ASIC/FPGA implementations can cache these linear maps, leaving only the evaluation of a single σ to be accelerated (often already supported for ReLU or piecewise‑linear splines).
Limitations & Future Work
- Constructive constants are not quantified – The proofs guarantee existence of approximations but do not provide explicit bounds on network depth or width needed for a given error tolerance.
- Focus on continuous functions – The universality results apply to C(K) (continuous functions on compact sets). Extending the analysis to Lᵖ spaces or to functions with discontinuities (e.g., classification boundaries) remains open.
- Empirical validation – While the theoretical claims are solid, systematic experiments comparing shallow vs. deep KANs with different σ choices would help translate the results into practical design rules.
- Optimization landscape – The paper does not address how the restricted edge function set influences training dynamics, convergence speed, or susceptibility to local minima.
- Extension to stochastic or adaptive σ – Investigating whether a learnable σ (e.g., a small neural sub‑module) can further reduce the required affine set or improve sample efficiency is a promising direction.
Bottom line for developers: If you’re experimenting with Kolmogorov‑Arnold Networks, pick any non‑affine activation (ReLU, tanh, a spline, etc.), sprinkle in a handful of fixed linear maps, and you have a theoretically guaranteed universal approximator—no need for elaborate activation libraries or massive hidden‑layer designs. This insight can streamline model prototyping, reduce memory usage, and open the door to efficient hardware implementations.
Authors
- Vugar Ismailov
Paper Information
- arXiv ID: 2604.23765v1
- Categories: cs.LG, cs.NE, math.FA
- Published: April 26, 2026
- PDF: Download PDF