[Paper] On the Infinite Width and Depth Limits of Predictive Coding Networks

Published: 3 days ago (February 7, 2026 at 03:47 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.07697v1

Overview

Predictive Coding Networks (PCNs) offer a biologically‑inspired alternative to back‑propagation (BP) by first settling neural activities to an energy minimum before updating weights. This paper investigates whether PCNs can scale to the same massive widths and depths that modern deep‑learning models enjoy, and whether their learning dynamics ultimately match those of BP.

Key Contributions

Theoretical equivalence: Proves that, for linear residual architectures, the set of width‑ and depth‑stable parameterisations that make PCNs trainable is identical to the set for standard BP.
Infinite‑width/depth analysis: Shows that when the network width vastly exceeds its depth, the PC energy at activity equilibrium converges to the usual BP loss, meaning PCNs compute the same gradients as BP in this regime.
Unified view of prior work: Bridges earlier empirical tricks (BP‑inspired re‑parameterisations) and recent theoretical results under a single framework.
Empirical validation on nonlinear nets: Demonstrates that the theoretical predictions hold for deep nonlinear PCNs, provided the activity dynamics reach equilibrium.

Methodology

Model choice: The authors start with linear residual networks because they are analytically tractable yet capture the essence of deep architectures.
Parameterisation analysis: They examine how scaling the weights with width ((1/\sqrt{n})) and depth ((1/L)) affects the stability of both the activity dynamics (energy minimisation) and the weight‑update dynamics.
Infinite‑limit calculus: By letting the hidden dimension (n \to \infty) while keeping the depth (L) finite (or much smaller than (n)), they derive the limiting form of the PC energy and show it matches the BP loss.
Extension to non‑linear nets: Using the same scaling rules, they train deep convolutional PCNs on standard vision benchmarks, monitoring whether the activity dynamics converge (i.e., the “equilibrium” condition).
Comparative experiments: Gradient norms, training curves, and final test accuracies are compared between PCNs and BP‑trained counterparts.

Results & Findings

Stability region matches BP: The admissible scaling rules that keep training stable for PCNs are exactly those known for BP (e.g., He‑type initialization).
Energy → loss convergence: In the wide‑over‑deep regime, the PC energy after activity equilibration becomes mathematically indistinguishable from the BP loss, implying identical gradient signals.
Empirical parity: On CIFAR‑10/100 and ImageNet‑subset experiments, deep PCNs trained with the derived scaling achieve comparable accuracy and convergence speed to BP, as long as the iterative activity updates are run until convergence.
Equilibrium matters: When the activity dynamics are stopped early (i.e., before reaching equilibrium), the gradient mismatch grows, leading to slower or unstable training.

Practical Implications

Scalable PCNs: Developers can now build PCNs that are as wide and deep as modern transformers or ResNets, using familiar initialization schemes.
Hardware‑friendly training: Since PCNs separate activity inference (a fixed‑point iteration) from weight updates, they open the door to asynchronous or neuromorphic hardware where inference can be run continuously while learning proceeds more slowly.
Energy‑based regularisation: The explicit energy function provides a natural way to incorporate additional constraints (e.g., sparsity, robustness) without redesigning the loss.
Hybrid training pipelines: One could start training with BP for speed, then switch to PC inference‑only mode for continual learning or on‑device adaptation, leveraging the proven gradient equivalence.

Limitations & Future Work

Equilibrium requirement: The theoretical guarantees hinge on reaching a true activity equilibrium, which may be costly for very deep or recurrent structures.
Non‑linear proof missing: The rigorous equivalence is shown only for linear residual nets; extending the proof to arbitrary nonlinearities remains an open challenge.
Memory and compute overhead: Iterative activity updates add runtime and memory overhead compared to a single forward pass in BP.
Future directions: The authors suggest exploring approximate equilibrium schemes (e.g., truncated iterations, learned solvers), extending the analysis to transformer‑style attention layers, and investigating how the energy formulation can be exploited for continual or meta‑learning scenarios.

Authors

Francesco Innocenti
El Mehdi Achour
Rafal Bogacz

Paper Information

arXiv ID: 2602.07697v1
Categories: cs.LG, cs.AI, cs.NE
Published: February 7, 2026
PDF: Download PDF

[Paper] On the Infinite Width and Depth Limits of Predictive Coding Networks

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Robustness Is a Function, Not a Number: A Factorized Comprehensive Study of OOD Robustness in Vision-Based Driving

[Paper] Contact-Anchored Policies: Contact Conditioning Creates Strong Robot Utility Models

[Paper] CIC-Trap4Phish: A Unified Multi-Format Dataset for Phishing and Quishing Attachment Detection

[Paper] ArcFlow: Unleashing 2-Step Text-to-Image Generation via High-Precision Non-Linear Flow Distillation