[Paper] Joint Hardware-Workload Co-Optimization for In-Memory Computing Accelerators

Published: 2 days ago (March 4, 2026 at 04:32 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2603.03880v1

Overview

The paper introduces a joint hardware‑workload co‑optimization framework for in‑memory computing (IMC) accelerators that can efficiently run multiple neural‑network models on the same chip. By moving beyond single‑workload design, the authors demonstrate that a single, more general‑purpose IMC platform can achieve near‑optimal energy, speed, and area performance across a diverse set of AI workloads.

Key Contributions

Cross‑workload co‑design methodology that simultaneously optimizes hardware parameters and neural‑network mapping strategies.
Evolutionary‑algorithm‑based optimizer tailored to capture trade‑offs among energy, latency, and silicon area (EDAP).
Unified framework applicable to both RRAM‑ and SRAM‑based IMC fabrics, showing hardware‑agnostic flexibility.
Empirical validation on 4‑workload (small) and 9‑workload (large) benchmark suites, achieving up to 76 % (small set) and 95 % (large set) EDAP reduction versus workload‑specific baselines.
Open‑source release of the entire optimization stack, enabling reproducibility and community extensions.

Methodology

Design Space Definition – The authors enumerate configurable hardware knobs (e.g., array size, peripheral circuitry, precision, peripheral ADC/DAC resolution) and workload mapping choices (layer tiling, data quantization, sparsity exploitation).
Multi‑objective Evolutionary Search – An adapted NSGA‑II algorithm evaluates candidate hardware‑workload pairs against three objectives: energy, latency, and area. The fitness function aggregates them into the Energy‑Delay‑Area Product (EDAP).
Cross‑Workload Fitness Aggregation – Instead of optimizing for a single model, the algorithm computes a weighted EDAP across all target workloads, forcing the search to favor designs that perform well on average while still respecting worst‑case constraints.
Hardware‑Aware Simulation Loop – Each candidate is fed into a fast, cycle‑accurate IMC simulator (supporting both RRAM and SRAM crossbars) that estimates power, timing, and layout area, feeding back into the evolutionary loop.
Pareto Extraction & Selection – The final Pareto front is examined, and the design with the best trade‑off (lowest aggregated EDAP) is selected as the “general‑purpose” accelerator.

Results & Findings

Benchmark Set	Baseline (single‑workload) EDAP	Joint Co‑opt EDAP	Reduction
4 workloads	– (varies per model)	24 % of baseline	≈ 76 %
9 workloads	– (varies per model)	4.5 % of baseline	≈ 95 %

Robustness Across Technologies – Both RRAM and SRAM implementations showed similar relative gains, confirming that the approach is not tied to a specific memory technology.
Area Savings – Optimized designs often required fewer peripheral ADCs/DACs because the algorithm learned to balance precision needs across workloads.
Latency Trade‑offs – While some workloads experienced modest latency increases (≈ 5‑10 %), the overall EDAP improvement outweighed these penalties.
Scalability – Adding more workloads to the optimization set continued to improve the generalized design’s efficiency, indicating diminishing returns only after a certain diversity threshold.

Practical Implications

One‑Chip Multi‑Model Deployments – Device manufacturers can ship a single IMC accelerator that serves edge AI devices (e.g., smart cameras, IoT sensors) running a portfolio of models without needing custom silicon per application.
Reduced NRE Costs – By avoiding per‑model ASIC designs, companies can lower non‑recurring engineering expenses and accelerate time‑to‑market.
Energy‑Constrained Edge – The dramatic EDAP reductions translate directly into longer battery life for wearables and remote sensors that rely on in‑memory AI inference.
Design Automation Integration – The open‑source framework can be plugged into existing EDA flows, allowing hardware teams to co‑optimize with software engineers early in the product development cycle.
Technology‑agnostic Portability – Since the method works for both emerging RRAM and mature SRAM crossbars, it future‑proofs designs against shifts in memory technology roadmaps.

Limitations & Future Work

Simulation Fidelity – The study relies on analytical power/area models; real silicon measurements could reveal additional parasitics not captured.
Workload Diversity – Benchmarks focus on convolutional neural networks; extending to transformers, graph neural networks, or spiking models may require new hardware knobs.
Dynamic Reconfiguration – The current framework yields a static hardware configuration; exploring runtime‑adaptive crossbar sizing or precision scaling could further close the gap to workload‑specific designs.
Manufacturing Variability – Process variations in emerging RRAM devices can affect yield; incorporating statistical robustness into the optimizer is a promising next step.

Overall, this research provides a concrete pathway for building versatile, high‑performance IMC accelerators that meet the real‑world demands of multi‑model AI deployment.

Authors

Olga Krestinskaya
Mohammed E. Fouda
Ahmed Eltawil
Khaled N. Salama

Paper Information

arXiv ID: 2603.03880v1
Categories: cs.AR, cs.AI, cs.ET, cs.NE, eess.SY
Published: March 4, 2026
PDF: Download PDF

[Paper] Joint Hardware-Workload Co-Optimization for In-Memory Computing Accelerators

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] RoboPocket: Improve Robot Policies Instantly with Your Phone

[Paper] POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation

[Paper] The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

[Paper] Cheap Thrills: Effective Amortized Optimization Using Inexpensive Labels