[Paper] Joint Hardware-Workload Co-Optimization for In-Memory Computing Accelerators
Source: arXiv - 2603.03880v1
Overview
The paper introduces a joint hardware‑workload co‑optimization framework for in‑memory computing (IMC) accelerators that can efficiently run multiple neural‑network models on the same chip. By moving beyond single‑workload design, the authors demonstrate that a single, more general‑purpose IMC platform can achieve near‑optimal energy, speed, and area performance across a diverse set of AI workloads.
Key Contributions
- Cross‑workload co‑design methodology that simultaneously optimizes hardware parameters and neural‑network mapping strategies.
- Evolutionary‑algorithm‑based optimizer tailored to capture trade‑offs among energy, latency, and silicon area (EDAP).
- Unified framework applicable to both RRAM‑ and SRAM‑based IMC fabrics, showing hardware‑agnostic flexibility.
- Empirical validation on 4‑workload (small) and 9‑workload (large) benchmark suites, achieving up to 76 % (small set) and 95 % (large set) EDAP reduction versus workload‑specific baselines.
- Open‑source release of the entire optimization stack, enabling reproducibility and community extensions.
Methodology
- Design Space Definition – The authors enumerate configurable hardware knobs (e.g., array size, peripheral circuitry, precision, peripheral ADC/DAC resolution) and workload mapping choices (layer tiling, data quantization, sparsity exploitation).
- Multi‑objective Evolutionary Search – An adapted NSGA‑II algorithm evaluates candidate hardware‑workload pairs against three objectives: energy, latency, and area. The fitness function aggregates them into the Energy‑Delay‑Area Product (EDAP).
- Cross‑Workload Fitness Aggregation – Instead of optimizing for a single model, the algorithm computes a weighted EDAP across all target workloads, forcing the search to favor designs that perform well on average while still respecting worst‑case constraints.
- Hardware‑Aware Simulation Loop – Each candidate is fed into a fast, cycle‑accurate IMC simulator (supporting both RRAM and SRAM crossbars) that estimates power, timing, and layout area, feeding back into the evolutionary loop.
- Pareto Extraction & Selection – The final Pareto front is examined, and the design with the best trade‑off (lowest aggregated EDAP) is selected as the “general‑purpose” accelerator.
Results & Findings
| Benchmark Set | Baseline (single‑workload) EDAP | Joint Co‑opt EDAP | Reduction |
|---|---|---|---|
| 4 workloads | – (varies per model) | 24 % of baseline | ≈ 76 % |
| 9 workloads | – (varies per model) | 4.5 % of baseline | ≈ 95 % |
- Robustness Across Technologies – Both RRAM and SRAM implementations showed similar relative gains, confirming that the approach is not tied to a specific memory technology.
- Area Savings – Optimized designs often required fewer peripheral ADCs/DACs because the algorithm learned to balance precision needs across workloads.
- Latency Trade‑offs – While some workloads experienced modest latency increases (≈ 5‑10 %), the overall EDAP improvement outweighed these penalties.
- Scalability – Adding more workloads to the optimization set continued to improve the generalized design’s efficiency, indicating diminishing returns only after a certain diversity threshold.
Practical Implications
- One‑Chip Multi‑Model Deployments – Device manufacturers can ship a single IMC accelerator that serves edge AI devices (e.g., smart cameras, IoT sensors) running a portfolio of models without needing custom silicon per application.
- Reduced NRE Costs – By avoiding per‑model ASIC designs, companies can lower non‑recurring engineering expenses and accelerate time‑to‑market.
- Energy‑Constrained Edge – The dramatic EDAP reductions translate directly into longer battery life for wearables and remote sensors that rely on in‑memory AI inference.
- Design Automation Integration – The open‑source framework can be plugged into existing EDA flows, allowing hardware teams to co‑optimize with software engineers early in the product development cycle.
- Technology‑agnostic Portability – Since the method works for both emerging RRAM and mature SRAM crossbars, it future‑proofs designs against shifts in memory technology roadmaps.
Limitations & Future Work
- Simulation Fidelity – The study relies on analytical power/area models; real silicon measurements could reveal additional parasitics not captured.
- Workload Diversity – Benchmarks focus on convolutional neural networks; extending to transformers, graph neural networks, or spiking models may require new hardware knobs.
- Dynamic Reconfiguration – The current framework yields a static hardware configuration; exploring runtime‑adaptive crossbar sizing or precision scaling could further close the gap to workload‑specific designs.
- Manufacturing Variability – Process variations in emerging RRAM devices can affect yield; incorporating statistical robustness into the optimizer is a promising next step.
Overall, this research provides a concrete pathway for building versatile, high‑performance IMC accelerators that meet the real‑world demands of multi‑model AI deployment.
Authors
- Olga Krestinskaya
- Mohammed E. Fouda
- Ahmed Eltawil
- Khaled N. Salama
Paper Information
- arXiv ID: 2603.03880v1
- Categories: cs.AR, cs.AI, cs.ET, cs.NE, eess.SY
- Published: March 4, 2026
- PDF: Download PDF