[Paper] Intelligent Neural Networks: From Layered Architectures to Graph-Organized Intelligence

Published: (November 27, 2025 at 06:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.22813v1

Overview

Antoine Salomon’s paper proposes Intelligent Neural Networks (INN) – a new class of models where each artificial neuron is an autonomous, memory‑equipped unit that decides when to fire and where to send its output. Instead of the classic stacked‑layer design, INNs are organized as a fully‑connected graph, allowing dynamic, learned routing of information. On the Text8 character‑level language modeling benchmark, INNs beat a comparable Transformer and match a heavily tuned LSTM, while a parameter‑matched Mamba baseline fails to train, highlighting the stability benefits of the graph topology.

Key Contributions

  • Neuron‑centric abstraction: Introduces “Intelligent Neurons” that combine internal state dynamics with attention‑based communication.
  • Graph‑organized architecture: Replaces rigid hierarchical layers with a complete graph, enabling flexible, learned routing between neurons.
  • Training stability proof‑point: Shows that a graph‑based INN converges where a similarly sized stacked Mamba model diverges (>3.4 BPC), attributing stability to the graph topology.
  • Empirical performance: Achieves 1.705 BPC on Text8, surpassing a Transformer baseline (2.055 BPC) and matching a state‑of‑the‑art LSTM.
  • Ablation analysis: Demonstrates that removing inter‑neuron communication degrades accuracy or causes training collapse, confirming the necessity of learned routing.

Methodology

  1. Intelligent Neuron design – Each neuron maintains a hidden state (similar to an RNN cell) and contains two learned modules:
    • Activation gate decides whether the neuron should emit a signal at a given timestep.
    • Routing attention computes a soft‑max over all other neurons, producing a weighted message‑passing vector.
  2. Graph construction – All neurons are connected, forming a complete directed graph. The routing attention dynamically re‑weights edges each step, so the effective computation graph changes over time.
  3. Training loop – The model is trained end‑to‑end with standard cross‑entropy loss on the next‑character prediction task. Gradient flow passes through both the internal dynamics and the routing attention, allowing the network to discover efficient communication patterns.
  4. Baselines & controls – For a fair comparison, the authors match total parameter counts across INN, a Transformer, and a stacked Mamba configuration, and run identical optimization schedules (AdamW, cosine decay). Ablation experiments systematically disable the activation gate or the routing attention to isolate their contributions.

Results & Findings

ModelParams (M)BPC (Text8)
INN (proposed)≈ 301.705
Transformer (matched)≈ 302.055
Optimized LSTM≈ 301.70 (≈)
Stacked Mamba (matched)≈ 30> 3.4 (non‑convergent)
  • Performance: INN matches the best LSTM results while beating the Transformer by ~0.35 BPC, a substantial gain on a character‑level benchmark.
  • Stability: The Mamba baseline collapses under the same training regime, suggesting that the graph‑based routing mitigates gradient explosion/vanishing that plagues deep sequential stacks.
  • Ablations: Removing the routing attention raises BPC to ~2.2; disabling the activation gate leads to divergence, confirming that both components are essential.
  • Interpretability hint: Visualizing the learned attention weights reveals clusters of neurons that specialize in particular character patterns (e.g., punctuation, common digrams), hinting at modular behavior.

Practical Implications

  • Modular AI components: Developers can think of each neuron as a plug‑and‑play module with its own memory, making it easier to isolate, debug, or replace parts of a model.
  • Dynamic compute allocation: Because routing is data‑dependent, INNs can allocate more resources to “hard” inputs and fewer to easy ones, opening avenues for adaptive inference budgets.
  • Robustness to depth: The graph topology sidesteps many depth‑related training issues, potentially simplifying the design of very deep or wide models for tasks like long‑range language modeling, graph processing, or reinforcement learning.
  • Interpretability tools: The attention matrix over neurons offers a natural hook for visual analytics, enabling developers to trace which sub‑networks are active for a given input.
  • Hardware friendliness: Since communication is soft‑max‑weighted rather than hard‑wired, the architecture maps well to modern accelerators that support sparse or dynamic tensor operations (e.g., NVIDIA’s sparse attention kernels, Graphcore IPUs).

Limitations & Future Work

  • Scalability of full graphs: A complete graph scales quadratically with neuron count, which may become prohibitive for very large models; the paper suggests exploring sparsified routing or hierarchical graph partitions.
  • Benchmark breadth: Evaluation is limited to a single character‑level language modeling task; broader tests (e.g., image classification, speech, RL) are needed to confirm generality.
  • Interpretability depth: While initial visualizations are promising, systematic methods for extracting human‑readable rules from the learned routing remain an open challenge.
  • Hardware optimization: Current implementations rely on dense matrix multiplications; future work could integrate custom kernels or hardware primitives to fully exploit the dynamic routing paradigm.

Authors

  • Antoine Salomon

Paper Information

  • arXiv ID: 2511.22813v1
  • Categories: cs.LG, cs.CL, cs.NE
  • Published: November 27, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »