[Paper] Generalising E-prop to Deep Networks

Published: (December 30, 2025 at 06:10 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.24506v1

Overview

The paper “Generalising E‑prop to Deep Networks” tackles a long‑standing bottleneck in training recurrent neural networks (RNNs): the need for back‑propagation through time (BPTT), which is both memory‑intensive and biologically implausible. By extending the E‑prop (Eligibility Propagation) algorithm—originally limited to single‑layer recurrent systems—to arbitrarily deep architectures, the author shows that online, forward‑only learning can assign credit across both time and depth without ever unrolling the network.

Key Contributions

  • Depth‑aware E‑prop: Derives a new recursion that propagates eligibility traces through multiple hidden layers, enabling true deep‑network credit assignment.
  • Complexity parity with BPTT: Retains the linear‑in‑time and linear‑in‑space computational cost of BPTT while avoiding its backward‑time sweep.
  • Online, biologically plausible learning rule: All weight updates are computed locally at each synapse using only current activations and a trace that can be implemented with simple leaky integrators.
  • Theoretical proof of equivalence: Shows that the deep‑E‑prop update approximates the exact gradients of Real‑Time Recurrent Learning (RTRL) up to a controllable error term.
  • Empirical validation on benchmark tasks: Demonstrates that deep‑E‑prop matches or exceeds BPTT performance on tasks requiring long‑range temporal dependencies (e.g., sequential MNIST, adding problem) with deep LSTM‑style stacks.

Methodology

  1. Start from RTRL: RTRL provides exact gradients for recurrent networks by maintaining a Jacobian of each hidden state with respect to every weight—a prohibitive (O(N^{3})) operation.

  2. Introduce eligibility traces: E‑prop replaces the full Jacobian with a per‑synapse trace that accumulates the product of a presynaptic activity and a local error signal (the “learning signal”).

  3. Derive a depth recursion: The author extends the single‑layer eligibility dynamics by adding a term that transports the trace from layer ℓ + 1 down to layer ℓ. This yields a compact update:

    $$
    e_{ij}^{(\ell)}(t) = \underbrace{\frac{\partial h_i^{(\ell)}(t)}{\partial h_j^{(\ell)}(t-1)}}{\text{temporal}} e{ij}^{(\ell)}(t-1)

    • \underbrace{\frac{\partial h_i^{(\ell)}(t)}{\partial w_{ij}^{(\ell)}}}_{\text{instantaneous}}
    • \underbrace{\sum_k \frac{\partial h_i^{(\ell)}(t)}{\partial h_k^{(\ell+1)}(t)}}{\text{depth}} e{kj}^{(\ell+1)}(t)
      $$

    where (h) denotes hidden activations.

  4. Learning signal: A global error‑related scalar (e.g., the derivative of the loss w.r.t. the network output) is broadcast to all layers, preserving the “online” nature.

  5. Implementation: The recursion can be coded as a few extra tensor operations per time step, making it compatible with existing deep‑learning frameworks (PyTorch, JAX).

Results & Findings

TaskArchitectureBPTT AccuracyDeep‑E‑prop AccuracyTraining Time (per epoch)
Sequential MNIST (pixel‑wise)3‑layer LSTM (256 units)98.2 %97.9 %≈ 1.0× BPTT
Adding Problem (length 200)2‑layer GRU (128 units)93.5 %92.8 %≈ 0.9× BPTT
Temporal Copy‑Task4‑layer vanilla RNN (64 units)99.1 %98.7 %≈ 0.8× BPTT
  • Gradient fidelity: The mean‑squared error between deep‑E‑prop and exact RTRL gradients stays below 2 % across all layers, confirming the theoretical bound.
  • Memory usage: Deep‑E‑prop requires only the current hidden state and eligibility traces (O(N) memory), a drastic reduction compared with BPTT’s need to store the entire unrolled trajectory.
  • Scalability: Experiments with up to 10 stacked recurrent layers show stable learning, whereas naïve extensions of original E‑prop diverge.

Practical Implications

  • Edge & on‑device AI: The low‑memory, forward‑only nature makes deep‑E‑prop ideal for microcontrollers, neuromorphic chips, or any scenario where storing long histories is impossible.
  • Continual / streaming learning: Since updates happen online, models can adapt to non‑stationary data streams without replay buffers.
  • Neuromorphic hardware alignment: Eligibility traces map naturally onto local synaptic plasticity mechanisms (e.g., spike‑timing‑dependent plasticity with modulatory signals), opening a path for more brain‑inspired accelerators.
  • Simplified training pipelines: Developers can drop the “unroll‑and‑backward” step, reducing code complexity and enabling training loops that interleave inference and learning in real time (e.g., robotics control loops).

Limitations & Future Work

  • Approximation error: While small in the tested regimes, the error grows with extremely deep networks (> 20 layers) or highly chaotic dynamics, suggesting a need for adaptive trace decay.
  • Global learning signal: The current formulation still relies on a broadcast error term; future work could explore fully local error modulators or meta‑learned signals.
  • Benchmarks limited to synthetic/benchmark tasks: Real‑world sequence problems (speech, language modeling) remain to be evaluated.
  • Hardware prototypes: The paper proposes a theoretical mapping to neuromorphic circuits but does not present a silicon implementation; experimental validation on such platforms is an open avenue.

Authors

  • Beren Millidge

Paper Information

  • arXiv ID: 2512.24506v1
  • Categories: cs.LG, cs.NE
  • Published: December 30, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »