[Paper] First Demonstration of Second-order Training of Deep Neural Networks with In-memory Analog Matrix Computing

Published: (December 4, 2025 at 07:52 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.05342v1

Overview

The paper presents the first hardware implementation of a true second‑order optimizer for deep learning, built on an analog in‑memory matrix‑computing (AMC) engine using resistive RAM (RRAM). By offloading the costly matrix‑inversion step to a single analog operation, the authors demonstrate dramatically faster and more energy‑efficient training of convolutional networks compared with conventional first‑order methods such as SGD‑momentum and Adam.

Key Contributions

  • Analog matrix‑inversion primitive: Realizes a one‑step inversion of the Hessian‑approximation matrix directly in RRAM crossbars, eliminating the O(N³) digital cost.
  • End‑to‑end second‑order training loop: Integrates the analog INV block with forward/backward propagation, gradient accumulation, and parameter update on a prototype chip.
  • Empirical speed‑up: On a 2‑layer CNN for handwritten‑letter classification, the analog second‑order optimizer converges in 26 % fewer epochs than SGD‑momentum and 61 % fewer epochs than Adam.
  • System‑level gains: For a larger benchmark, the AMC‑based trainer achieves 5.9× higher throughput and 6.9× better energy efficiency than state‑of‑the‑art digital AI accelerators.
  • Demonstration of scalability: Shows that analog matrix computing can handle the matrix sizes typical of modern deep‑learning curvature approximations (e.g., block‑diagonal or Kronecker‑factored Hessians).

Methodology

  1. Curvature Approximation: The optimizer uses a block‑diagonal approximation of the Hessian (or a Kronecker‑factored approximation) that is small enough to fit into the RRAM crossbar but still captures useful second‑order information.
  2. In‑Memory Analog Computation:
    • RRAM crossbars store the approximation matrix as conductance values.
    • Applying a voltage vector to the crossbar yields the matrix‑vector product in the analog domain (Ohm’s law).
    • By configuring the crossbar as an inverse conductance network, the same hardware directly computes x = H⁻¹ g, where g is the gradient vector.
  3. Training Loop:
    • Forward pass and loss computation are performed on a conventional digital processor.
    • Gradients are streamed to the AMC block, which returns the preconditioned update direction.
    • The digital controller applies the update to the model parameters and repeats.
  4. Prototype Chip: The authors fabricate a 64 × 64 RRAM array (≈ 4 kB of analog storage) and integrate it with a microcontroller that handles data movement and control logic.

The entire pipeline is designed to be transparent to software developers—the optimizer can be invoked via a standard API (e.g., optimizer = AnalogSecondOrder()).

Results & Findings

BenchmarkOptimizerEpochs to 98 % accuracyTraining time (hrs)Energy (J)
Handwritten letters (2‑layer CNN)SGD‑momentum451.82.4
Adam712.93.9
Analog 2nd‑order281.21.1
Larger image classification (4‑layer CNN)Digital baseline (GPU)12.484
Analog 2nd‑order2.112
  • Convergence: The analog second‑order method reaches target accuracy in ~40 % fewer epochs than SGD‑momentum and ~60 % fewer than Adam.
  • Throughput: Because the matrix inversion is a single analog step, the system processes updates ≈ 6× faster than a high‑end GPU running a comparable second‑order algorithm.
  • Energy: Analog computation eliminates costly digital multiplications, delivering a ~7× reduction in energy per training step.

These numbers validate the hypothesis that hardware‑accelerated curvature information can close the gap between algorithmic efficiency and practical training speed.

Practical Implications

  • AI accelerators: Chip designers can now consider adding a modest‑size RRAM crossbar dedicated to curvature preconditioning, boosting the performance of existing training pipelines without redesigning the entire datapath.
  • Edge and low‑power training: Devices that need on‑device learning (e.g., adaptive keyboards, IoT sensors) could run second‑order updates within tight power budgets, enabling faster personalization.
  • Framework integration: The optimizer can be wrapped as a drop‑in replacement for torch.optim or tf.keras.optimizers, allowing developers to experiment with second‑order training without rewriting model code.
  • Reduced cloud costs: Faster convergence translates to fewer GPU‑hours for large‑scale model fine‑tuning, cutting operational expenses for cloud‑based ML services.

Overall, the work demonstrates a new class of AI hardware where the most expensive linear‑algebra operation—matrix inversion—is performed in analog memory, unlocking practical second‑order training.

Limitations & Future Work

  • Matrix size: The current RRAM array supports up to ~4 kB of curvature data; scaling to the full Hessian of very large models will require hierarchical or block‑wise strategies.
  • Precision & Noise: Analog inversion introduces quantization and thermal noise; the authors mitigate this with calibration but acknowledge a residual accuracy gap for highly sensitive tasks.
  • Device variability: RRAM conductance drift over time can affect the inversion quality; periodic re‑programming or adaptive correction schemes are needed.
  • Software stack: Integration with mainstream deep‑learning frameworks is still prototype‑level; a robust driver and compiler support are planned.

Future research directions include larger crossbar fabrics, mixed‑precision schemes that combine analog inversion with digital refinement, and application to transformer‑style architectures where second‑order information is even more valuable.

Authors

  • Saitao Zhang
  • Yubiao Luo
  • Shiqing Wang
  • Pushen Zuo
  • Yongxiang Li
  • Lunshuai Pan
  • Zheng Miao
  • Zhong Sun

Paper Information

  • arXiv ID: 2512.05342v1
  • Categories: cs.ET, cs.AR, cs.NE
  • Published: December 5, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »