[Paper] Generalising E-prop to Deep Networks

Published: 1 month ago (December 30, 2025 at 06:10 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.24506v1

Overview

The paper “Generalising E‑prop to Deep Networks” tackles a long‑standing bottleneck in training recurrent neural networks (RNNs): the need for back‑propagation through time (BPTT), which is both memory‑intensive and biologically implausible. By extending the E‑prop (Eligibility Propagation) algorithm—originally limited to single‑layer recurrent systems—to arbitrarily deep architectures, the author shows that online, forward‑only learning can assign credit across both time and depth without ever unrolling the network.

Key Contributions

Depth‑aware E‑prop: Derives a new recursion that propagates eligibility traces through multiple hidden layers, enabling true deep‑network credit assignment.
Complexity parity with BPTT: Retains the linear‑in‑time and linear‑in‑space computational cost of BPTT while avoiding its backward‑time sweep.
Online, biologically plausible learning rule: All weight updates are computed locally at each synapse using only current activations and a trace that can be implemented with simple leaky integrators.
Theoretical proof of equivalence: Shows that the deep‑E‑prop update approximates the exact gradients of Real‑Time Recurrent Learning (RTRL) up to a controllable error term.
Empirical validation on benchmark tasks: Demonstrates that deep‑E‑prop matches or exceeds BPTT performance on tasks requiring long‑range temporal dependencies (e.g., sequential MNIST, adding problem) with deep LSTM‑style stacks.

Methodology

Start from RTRL: RTRL provides exact gradients for recurrent networks by maintaining a Jacobian of each hidden state with respect to every weight—a prohibitive (O(N^{3})) operation.
Introduce eligibility traces: E‑prop replaces the full Jacobian with a per‑synapse trace that accumulates the product of a presynaptic activity and a local error signal (the “learning signal”).
Derive a depth recursion: The author extends the single‑layer eligibility dynamics by adding a term that transports the trace from layer ℓ + 1 down to layer ℓ. This yields a compact update:

$$
e_{ij}^{(\ell)}(t) = \underbrace{\frac{\partial h_i^{(\ell)}(t)}{\partial h_j^{(\ell)}(t-1)}}{\text{temporal}} e{ij}^{(\ell)}(t-1)
- \underbrace{\frac{\partial h_i^{(\ell)}(t)}{\partial w_{ij}^{(\ell)}}}_{\text{instantaneous}}
- \underbrace{\sum_k \frac{\partial h_i^{(\ell)}(t)}{\partial h_k^{(\ell+1)}(t)}}{\text{depth}} e{kj}^{(\ell+1)}(t)
  $$
where (h) denotes hidden activations.
Learning signal: A global error‑related scalar (e.g., the derivative of the loss w.r.t. the network output) is broadcast to all layers, preserving the “online” nature.
Implementation: The recursion can be coded as a few extra tensor operations per time step, making it compatible with existing deep‑learning frameworks (PyTorch, JAX).

Results & Findings

Task	Architecture	BPTT Accuracy	Deep‑E‑prop Accuracy	Training Time (per epoch)
Sequential MNIST (pixel‑wise)	3‑layer LSTM (256 units)	98.2 %	97.9 %	≈ 1.0× BPTT
Adding Problem (length 200)	2‑layer GRU (128 units)	93.5 %	92.8 %	≈ 0.9× BPTT
Temporal Copy‑Task	4‑layer vanilla RNN (64 units)	99.1 %	98.7 %	≈ 0.8× BPTT

Gradient fidelity: The mean‑squared error between deep‑E‑prop and exact RTRL gradients stays below 2 % across all layers, confirming the theoretical bound.
Memory usage: Deep‑E‑prop requires only the current hidden state and eligibility traces (O(N) memory), a drastic reduction compared with BPTT’s need to store the entire unrolled trajectory.
Scalability: Experiments with up to 10 stacked recurrent layers show stable learning, whereas naïve extensions of original E‑prop diverge.

Practical Implications

Edge & on‑device AI: The low‑memory, forward‑only nature makes deep‑E‑prop ideal for microcontrollers, neuromorphic chips, or any scenario where storing long histories is impossible.
Continual / streaming learning: Since updates happen online, models can adapt to non‑stationary data streams without replay buffers.
Neuromorphic hardware alignment: Eligibility traces map naturally onto local synaptic plasticity mechanisms (e.g., spike‑timing‑dependent plasticity with modulatory signals), opening a path for more brain‑inspired accelerators.
Simplified training pipelines: Developers can drop the “unroll‑and‑backward” step, reducing code complexity and enabling training loops that interleave inference and learning in real time (e.g., robotics control loops).

Limitations & Future Work

Approximation error: While small in the tested regimes, the error grows with extremely deep networks (> 20 layers) or highly chaotic dynamics, suggesting a need for adaptive trace decay.
Global learning signal: The current formulation still relies on a broadcast error term; future work could explore fully local error modulators or meta‑learned signals.
Benchmarks limited to synthetic/benchmark tasks: Real‑world sequence problems (speech, language modeling) remain to be evaluated.
Hardware prototypes: The paper proposes a theoretical mapping to neuromorphic circuits but does not present a silicon implementation; experimental validation on such platforms is an open avenue.

Authors

Beren Millidge

Paper Information

arXiv ID: 2512.24506v1
Categories: cs.LG, cs.NE
Published: December 30, 2025
PDF: Download PDF

[Paper] Generalising E-prop to Deep Networks

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Two Deep Learning Approaches for Automated Segmentation of Left Ventricle in Cine Cardiac MRI

[Paper] Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning

[Paper] FedHypeVAE: Federated Learning with Hypernetwork Generated Conditional VAEs for Differentially Private Embedding Sharing

[Paper] Categorical Reparameterization with Denoising Diffusion models