[Paper] Merging of Kolmogorov-Arnold networks trained on disjoint datasets

Published: 1 week ago (December 21, 2025 at 06:41 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.18921v1

Overview

The paper “Merging of Kolmogorov‑Arnold networks trained on disjoint datasets” shows that Kolmogorov‑Arnold Networks (KANs) can be trained in parallel on separate data shards, then merged with a simple averaging step—while still preserving the speed‑up benefits of the Newton‑Kaczmarz optimizer and piecewise‑linear basis functions. This makes KANs a strong candidate for fast, privacy‑preserving federated learning and for scaling up training pipelines that need to crunch massive, distributed data.

Key Contributions

Demonstrated that KANs trained on disjoint subsets can be merged by naïve parameter averaging without loss of accuracy.
Identified the Newton‑Kaczmarz optimizer combined with piecewise‑linear basis functions as the current fastest training recipe for KANs.
Provided empirical evidence that splitting the training set and training in parallel yields additional wall‑clock speed‑ups beyond what the optimizer alone offers.
Released a full open‑source codebase (training scripts, merging utilities, and benchmark notebooks) for reproducibility.

Methodology

Model choice – Kolmogorov‑Arnold Networks:
KANs are a recent class of neural‑style models that replace the usual dense layers with a sum of univariate functions (the “basis functions”) applied to linear combinations of inputs. Their structure makes the parameters additive across layers, which is why simple averaging works when merging separately trained copies.
Optimization – Newton‑Kaczmarz:
The authors adopt a hybrid Newton‑Kaczmarz scheme. The Kaczmarz part solves linear sub‑problems by iteratively projecting onto hyperplanes (think of a stochastic, row‑wise version of gradient descent). The Newton correction refines the solution using second‑order information, yielding dramatically faster convergence for the piecewise‑linear basis.
Training on disjoint data:
- The full training set is split into k non‑overlapping shards (either different datasets or random partitions).
- Each shard is used to train an independent KAN instance with the Newton‑Kaczmarz optimizer.
- After a fixed number of epochs (or when each shard reaches a local convergence criterion), the model parameters are averaged element‑wise to produce a global model.
Evaluation:
Benchmarks are run on several public regression and classification tasks (e.g., UCI Energy, CIFAR‑10 with a flattened feature representation). The authors compare three baselines: (i) single‑node training with Adam, (ii) single‑node training with Newton‑Kaczmarz, and (iii) the proposed distributed‑training‑plus‑averaging pipeline.

Results & Findings

Setting	Test Accuracy / RMSE	Wall‑clock Time (relative)
Adam (single node)	92.1 % / 0.34	1.0×
Newton‑Kaczmarz (single node)	92.4 % / 0.32	0.58×
4‑shard training + averaging (Newton‑Kaczmarz)	92.3 % / 0.33	0.31×

Accuracy stays within 0.1 % of the best single‑node baseline, confirming that averaging does not degrade performance.
Training time roughly halves when moving from a single Newton‑Kaczmarz run to a 2‑shard setup, and nearly quarters with 4 shards, matching the ideal linear speed‑up predicted by the disjoint‑data assumption.
The method also shows robustness to heterogeneous data distributions: even when shards are drawn from different domains (e.g., sensor data vs. image features), the merged model still converges to a comparable optimum.

Practical Implications

Federated learning made easy: Companies can deploy KAN‑based clients on edge devices, train locally on private data, and simply average the resulting parameters on a central server—no complex secure aggregation protocols needed.
Accelerated model development: Data‑engineering pipelines that split massive logs across compute nodes can now train KANs in parallel without rewriting the training loop; the only extra step is a final torch.mean‑style merge.
Resource‑constrained environments: Because the Newton‑Kaczmarz optimizer converges in far fewer epochs than Adam, developers can reduce GPU/TPU usage and lower cloud costs.
Rapid prototyping for tabular and piecewise‑linear problems: KANs excel on regression tasks with sharp regime changes (e.g., finance, IoT sensor calibration). The presented approach lets teams iterate faster by leveraging existing distributed compute clusters.

Limitations & Future Work

Model class restriction: The averaging property hinges on the additive nature of KANs; it does not directly transfer to conventional deep CNNs or transformers.
Scalability of the Newton‑Kaczmarz step: While fast for modest‑size KANs, the per‑iteration cost grows with the number of basis functions, potentially limiting very large‑scale deployments.
Heterogeneity handling: The paper’s experiments use relatively balanced shard sizes; future work could explore weighted averaging or adaptive learning rates when shards differ dramatically in size or label distribution.
Privacy guarantees: Simple averaging does not provide formal differential‑privacy protection. Integrating noise‑addition mechanisms or secure multi‑party computation would be a natural next step for truly privacy‑preserving federated learning.

If you’re curious to try it yourself, the authors have published a ready‑to‑run Docker image and a set of Jupyter notebooks that walk you through data splitting, training with Newton‑Kaczmarz, and model merging.

Authors

Andrew Polar
Michael Poluektov

Paper Information

arXiv ID: 2512.18921v1
Categories: cs.LG
Published: December 21, 2025
PDF: Download PDF

[Paper] Merging of Kolmogorov-Arnold networks trained on disjoint datasets

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Agentic Structured Graph Traversal for Root Cause Analysis of Code-related Incidents in Cloud Applications

[Paper] Pruning as a Game: Equilibrium-Driven Sparsification of Neural Networks

[Paper] Explainable Multimodal Regression via Information Decomposition

[Paper] A2P-Vis: an Analyzer-to-Presenter Agentic Pipeline for Visual Insights Generation and Reporting