[Paper] Multiple Additive Neural Networks for Structured and Unstructured Data
Source: arXiv - 2604.26888v1
Overview
The paper introduces Multiple Additive Neural Networks (MANN) – a fresh take on the classic Gradient Boosting paradigm that swaps out decision‑tree learners for very shallow neural nets. By doing so, the authors unlock the ability to boost not only tabular (structured) data but also unstructured signals such as images, audio, and video, while keeping training stable and less sensitive to hyper‑parameter fiddling.
Key Contributions
- Boosting with shallow neural nets: Replaces tree‑based base learners with tiny CNNs or capsule nets, preserving the additive boosting logic.
- Unified framework for structured & unstructured data: Demonstrates that the same MANN pipeline can ingest tabular features and raw pixel/audio streams.
- Continuous‑learning architecture: Introduces a “partial‑fit” mechanism that lets new learners be added without retraining the whole ensemble.
- Built‑in over‑fitting safeguards: Combines early‑stopping heuristics, dropout‑style regularisation, and adaptive learning‑rate scaling inside the boosting loop.
- Empirical superiority: Shows consistent accuracy gains over XGBoost, LightGBM, and standard deep‑learning baselines on several public benchmarks.
Methodology
- Base Learner Design – Each weak learner is a nearly shallow neural network (typically 2‑3 convolutional layers for images or a capsule encoder for tabular data). The networks are deliberately lightweight to keep training fast and to avoid dominating the ensemble.
- Additive Boosting Loop – MANN follows the classic gradient‑boosting recipe:
- Compute the residual (negative gradient) of the current ensemble’s predictions.
- Train a new shallow net to fit these residuals.
- Add the trained net to the ensemble with a small step‑size (learning rate).
- Continuous Learning Extension – Instead of a fixed number of rounds, MANN can keep appending learners on‑the‑fly as new data arrives, making it suitable for streaming or online scenarios.
- Regularisation & Heuristics –
- Adaptive learning‑rate decay based on validation loss trends.
- Dropout/BatchNorm inside each shallow net to curb over‑fitting.
- Early‑stop per learner when residual improvement stalls.
- Structured‑Data Path – For tabular inputs, a capsule network extracts high‑level feature vectors that are then fed to the boosting layer, preserving the relational information that trees usually capture.
Results & Findings
| Dataset (type) | Baseline (XGB) | MANN | Relative Δ | Remarks |
|---|---|---|---|---|
| Adult (tabular) | 86.3 % | 88.1 % | +1.8 % | Faster convergence, fewer trees needed |
| CIFAR‑10 (image) | 93.2 % (ResNet‑18) | 94.0 % | +0.8 % | Same compute budget, shallower nets |
| Speech Commands (audio) | 95.1 % (CNN) | 96.3 % | +1.2 % | Boosted residual learning improves noisy classes |
| Higgs (large tabular) | 71.5 % (XGB) | 73.2 % | +1.7 % | Demonstrates scalability to millions of rows |
- Training time: Because each learner is tiny, the total wall‑clock time is comparable to a single deep net of similar capacity, yet the ensemble yields higher accuracy.
- Robustness to hyper‑parameters: Experiments varying learning‑rate (0.01–0.3) and number of boosting rounds (10–200) showed only modest performance swings, unlike XGB which can degrade sharply.
Practical Implications
- Unified pipelines: Data engineers can now use a single MANN‑based library for both classic tabular features (e.g., fraud detection) and raw sensor streams (e.g., video surveillance) without swapping toolkits.
- Online learning: The continuous‑learning mode fits real‑time use‑cases such as recommendation engines that must adapt to fresh click‑stream data on the fly.
- Reduced hyper‑parameter tuning burden: Teams can ship models faster because MANN tolerates a broader range of learning‑rate and iteration settings, cutting down on costly grid‑search cycles.
- Edge deployment: Shallow learners have a small memory footprint, making it feasible to ship an entire boosted ensemble to edge devices (IoT gateways, mobile phones) while still benefiting from the ensemble’s accuracy.
- Interpretability boost: Since each learner is a simple net, feature‑importance techniques (e.g., integrated gradients) can be applied per‑learner, offering a more granular view than a monolithic deep model.
Limitations & Future Work
- Depth vs. expressiveness trade‑off: While shallow nets keep training cheap, they may struggle on extremely complex vision tasks where deep residual networks still dominate.
- Memory growth in long‑running streams: Continuously adding learners can eventually hit memory limits; the authors suggest periodic pruning or knowledge‑distillation as mitigation.
- Limited ablation on capsule networks: The paper shows promising results but does not fully isolate the contribution of capsule encoders versus plain CNNs for structured data.
- Future directions include exploring dynamic learner‑size adaptation, integrating transformer‑style encoders as base learners, and formalising theoretical guarantees on convergence and generalisation.
Authors
- Janis Mohr
- Jörg Frochte
Paper Information
- arXiv ID: 2604.26888v1
- Categories: cs.LG
- Published: April 29, 2026
- PDF: Download PDF