[Paper] Bitwise Systolic Array Architecture for Runtime-Reconfigurable Multi-precision Quantized Multiplication on Hardware Accelerators
Source: arXiv - 2602.23334v1
Overview
The paper introduces a runtime‑reconfigurable bitwise systolic array that can execute mixed‑precision quantized neural networks (QNNs) on FPGA‑based accelerators. By allowing the precision of each layer to be changed on‑the‑fly, the design bridges the gap between the resource savings of low‑bit quantization and the accuracy of higher‑bit representations—an increasingly important trade‑off for edge AI devices.
Key Contributions
- Bitwise Systolic Array Architecture that natively supports multiple precisions (e.g., 2‑, 4‑, 8‑bit) without redesigning the datapath.
- Runtime Reconfiguration Logic enabling per‑layer precision switches during inference, eliminating the need for separate hardware builds for each QNN variant.
- Multi‑Channel Parallelism that processes several activation‑weight bit‑streams simultaneously, preserving throughput despite the added flexibility.
- Prototype on Ultra96 FPGA demonstrating 1.32×–3.57× speed‑up over fixed‑precision baselines and a higher achievable clock (250 MHz) thanks to reduced critical‑path delay.
- Comprehensive Evaluation on mixed‑precision models, showing that the architecture scales with both model size and precision diversity.
Methodology
- Bitwise Decomposition – Multiplications are expressed as a series of bit‑wise AND and popcount operations. This representation works uniformly for any integer bit‑width, so the same hardware can compute 2‑bit, 4‑bit, or 8‑bit products.
- Systolic Dataflow – The array is organized as a classic systolic grid where partial results flow diagonally across processing elements (PEs). Each PE contains a small bit‑wise multiplier, an accumulator, and a control unit that interprets the current precision setting.
- Precision Control Registers – A lightweight configuration register per PE (or per column) tells the PE how many bit‑planes to consume for the current layer. The control logic masks irrelevant bits, effectively “turning off” unused precision lanes.
- Runtime Scheduler – A software‑driven scheduler (running on the host CPU or a lightweight on‑chip controller) programs the precision registers before each layer’s execution, allowing the same hardware to serve heterogeneous QNNs in a single inference pass.
- Implementation on Ultra96 – The design is described in Verilog/SystemVerilog, synthesized for the Xilinx Zynq‑MPSoC on the Ultra96 board, and integrated with a simple DMA‑based data mover to feed activations and weights into the systolic array.
Results & Findings
| Metric | Fixed‑Precision Baseline | Proposed Reconfigurable Array |
|---|---|---|
| Inference Latency (mixed‑precision ResNet‑18) | 12.4 ms | 3.5 ms (≈3.57× speed‑up) |
| Throughput (images/s) | 80 | 285 |
| Critical Path Delay | 4.2 ns | 3.9 ns (≈7 % reduction) |
| Clock Frequency | 200 MHz | 250 MHz |
| Resource Utilization (LUTs/BRAM) | 45 % / 30 % | 48 % / 32 % (modest increase for flexibility) |
The results confirm that the bitwise systolic array can maintain or improve performance while supporting per‑layer precision changes. Accuracy loss typical of aggressive low‑bit quantization is mitigated because higher‑precision layers can be kept where they matter most.
Practical Implications
- Edge Device Deployments – Manufacturers can ship a single FPGA accelerator that adapts to different QNN models (e.g., a low‑power camera vs. a more demanding object detector) without hardware redesign.
- Dynamic Power Management – Lower‑precision layers consume fewer toggles and less dynamic power; the runtime reconfiguration enables fine‑grained power‑accuracy trade‑offs based on battery level or thermal constraints.
- Rapid Model Iteration – Data‑science teams can experiment with mixed‑precision schedules (e.g., via TensorFlow Lite’s quantization‑aware training) and immediately benchmark on the same silicon, shortening the hardware‑software co‑design loop.
- Scalable Cloud‑Edge Continuum – The same architecture can be instantiated on larger FPGAs for data‑center inference or scaled down for micro‑controllers, providing a unified programming model across the stack.
- Software Tooling – The design encourages the development of compiler passes that automatically emit per‑layer precision metadata, feeding directly into the runtime scheduler.
Limitations & Future Work
- Resource Overhead – Supporting the full precision range adds ~3 % more LUTs and BRAM compared to a single‑precision design; ultra‑resource‑constrained devices may still need a dedicated low‑bit accelerator.
- Precision Granularity – The current implementation toggles precision at the layer level; finer granularity (e.g., per‑channel or per‑neuron) is not yet supported.
- Toolchain Integration – The authors note that integration with mainstream deep‑learning compilers (TVM, ONNX Runtime) is still manual; automating this pipeline is a key next step.
- Broader Benchmark Suite – Evaluation focused on a handful of CNNs; extending tests to transformers, graph neural networks, and non‑vision workloads would validate generality.
Overall, the paper demonstrates that bitwise systolic arrays with runtime reconfigurability are a practical path toward flexible, high‑performance mixed‑precision AI accelerators—an advancement that could simplify deployment pipelines and unlock smarter edge devices.
Authors
- Yuhao Liu
- Salim Ullah
- Akash Kumar
Paper Information
- arXiv ID: 2602.23334v1
- Categories: cs.AR, cs.AI
- Published: February 26, 2026
- PDF: Download PDF