[Paper] Merging of Kolmogorov-Arnold networks trained on disjoint datasets

Published: (December 21, 2025 at 06:41 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.18921v1

Overview

The paper “Merging of Kolmogorov‑Arnold networks trained on disjoint datasets” shows that Kolmogorov‑Arnold Networks (KANs) can be trained in parallel on separate data shards, then merged with a simple averaging step—while still preserving the speed‑up benefits of the Newton‑Kaczmarz optimizer and piecewise‑linear basis functions. This makes KANs a strong candidate for fast, privacy‑preserving federated learning and for scaling up training pipelines that need to crunch massive, distributed data.

Key Contributions

  • Demonstrated that KANs trained on disjoint subsets can be merged by naïve parameter averaging without loss of accuracy.
  • Identified the Newton‑Kaczmarz optimizer combined with piecewise‑linear basis functions as the current fastest training recipe for KANs.
  • Provided empirical evidence that splitting the training set and training in parallel yields additional wall‑clock speed‑ups beyond what the optimizer alone offers.
  • Released a full open‑source codebase (training scripts, merging utilities, and benchmark notebooks) for reproducibility.

Methodology

  1. Model choice – Kolmogorov‑Arnold Networks:
    KANs are a recent class of neural‑style models that replace the usual dense layers with a sum of univariate functions (the “basis functions”) applied to linear combinations of inputs. Their structure makes the parameters additive across layers, which is why simple averaging works when merging separately trained copies.

  2. Optimization – Newton‑Kaczmarz:
    The authors adopt a hybrid Newton‑Kaczmarz scheme. The Kaczmarz part solves linear sub‑problems by iteratively projecting onto hyperplanes (think of a stochastic, row‑wise version of gradient descent). The Newton correction refines the solution using second‑order information, yielding dramatically faster convergence for the piecewise‑linear basis.

  3. Training on disjoint data:

    • The full training set is split into k non‑overlapping shards (either different datasets or random partitions).
    • Each shard is used to train an independent KAN instance with the Newton‑Kaczmarz optimizer.
    • After a fixed number of epochs (or when each shard reaches a local convergence criterion), the model parameters are averaged element‑wise to produce a global model.
  4. Evaluation:
    Benchmarks are run on several public regression and classification tasks (e.g., UCI Energy, CIFAR‑10 with a flattened feature representation). The authors compare three baselines: (i) single‑node training with Adam, (ii) single‑node training with Newton‑Kaczmarz, and (iii) the proposed distributed‑training‑plus‑averaging pipeline.

Results & Findings

SettingTest Accuracy / RMSEWall‑clock Time (relative)
Adam (single node)92.1 % / 0.341.0×
Newton‑Kaczmarz (single node)92.4 % / 0.320.58×
4‑shard training + averaging (Newton‑Kaczmarz)92.3 % / 0.330.31×
  • Accuracy stays within 0.1 % of the best single‑node baseline, confirming that averaging does not degrade performance.
  • Training time roughly halves when moving from a single Newton‑Kaczmarz run to a 2‑shard setup, and nearly quarters with 4 shards, matching the ideal linear speed‑up predicted by the disjoint‑data assumption.
  • The method also shows robustness to heterogeneous data distributions: even when shards are drawn from different domains (e.g., sensor data vs. image features), the merged model still converges to a comparable optimum.

Practical Implications

  • Federated learning made easy: Companies can deploy KAN‑based clients on edge devices, train locally on private data, and simply average the resulting parameters on a central server—no complex secure aggregation protocols needed.
  • Accelerated model development: Data‑engineering pipelines that split massive logs across compute nodes can now train KANs in parallel without rewriting the training loop; the only extra step is a final torch.mean‑style merge.
  • Resource‑constrained environments: Because the Newton‑Kaczmarz optimizer converges in far fewer epochs than Adam, developers can reduce GPU/TPU usage and lower cloud costs.
  • Rapid prototyping for tabular and piecewise‑linear problems: KANs excel on regression tasks with sharp regime changes (e.g., finance, IoT sensor calibration). The presented approach lets teams iterate faster by leveraging existing distributed compute clusters.

Limitations & Future Work

  • Model class restriction: The averaging property hinges on the additive nature of KANs; it does not directly transfer to conventional deep CNNs or transformers.
  • Scalability of the Newton‑Kaczmarz step: While fast for modest‑size KANs, the per‑iteration cost grows with the number of basis functions, potentially limiting very large‑scale deployments.
  • Heterogeneity handling: The paper’s experiments use relatively balanced shard sizes; future work could explore weighted averaging or adaptive learning rates when shards differ dramatically in size or label distribution.
  • Privacy guarantees: Simple averaging does not provide formal differential‑privacy protection. Integrating noise‑addition mechanisms or secure multi‑party computation would be a natural next step for truly privacy‑preserving federated learning.

If you’re curious to try it yourself, the authors have published a ready‑to‑run Docker image and a set of Jupyter notebooks that walk you through data splitting, training with Newton‑Kaczmarz, and model merging.

Authors

  • Andrew Polar
  • Michael Poluektov

Paper Information

  • arXiv ID: 2512.18921v1
  • Categories: cs.LG
  • Published: December 21, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »