[Paper] CORVET: A CORDIC-Powered, Resource-Frugal Mixed-Precision Vector Processing Engine for High-Throughput AIoT applications

Published: (February 22, 2026 at 11:51 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.19268v1

Overview

A new paper introduces CORVET, a mixed‑precision vector processing engine that uses a CORDIC‑based multiply‑accumulate (MAC) unit to deliver high‑throughput AI inference on ultra‑low‑resource edge devices. By dynamically switching between approximate and accurate computation modes, CORVET squeezes up to 4× more operations per second out of the same silicon area, making it a strong candidate for AI‑of‑Things (AIoT) workloads such as object detection and classification.

Key Contributions

  • CORDIC‑powered MAC: An iterative, resource‑frugal MAC that cuts latency by up to 33 % and power by 21 % compared with conventional multipliers.
  • Runtime‑adaptive precision: Supports 4‑, 8‑, and 16‑bit data widths and can toggle between approximate (fast, low‑accuracy) and accurate (slow, high‑accuracy) modes on the fly.
  • Time‑multiplexed vector engine: A 256‑PE (processing element) array that reuses hardware across vector lanes, achieving 4.83 TOPS/mm² compute density and 11.67 TOPS/W energy efficiency.
  • Lightweight pooling & normalization block: Integrated post‑processing that avoids extra memory traffic and keeps the data path tight.
  • Hardware‑software co‑design flow: Demonstrated on a Pynq‑Z2 FPGA platform for real‑world object detection/classification pipelines, showing end‑to‑end scalability.

Methodology

The authors built a mixed‑precision vector engine around a CORDIC (Coordinate Rotation Digital Computer) unit, which computes multiplications through a series of shift‑add iterations rather than full‑width multipliers. This yields a smaller, power‑lean MAC cell.

Key architectural tricks

  1. Dynamic Mode Switching – A control FSM selects either an approximate CORDIC configuration (fewer iterations, lower latency) or a full‑accuracy configuration (more iterations) based on the current layer’s tolerance to error.
  2. Vectorisation & Time‑Multiplexing – A single MAC array is shared across multiple vector lanes; the engine cycles through lanes each clock, effectively multiplying throughput without replicating hardware.
  3. Precision Scaling – Input operands are quantized to 4/8/16 bits on the fly; the CORDIC pipeline automatically adapts its shift‑add schedule to the chosen bit‑width, keeping latency proportional to precision.
  4. Co‑design with Software – The authors extended a compiler backend to emit control hints (precision, mode) for each layer of a neural network, allowing the hardware to reconfigure at runtime with negligible overhead.

The design was synthesized both as an ASIC macro and as an FPGA overlay (on a Xilinx Pynq‑Z2 board) to validate silicon‑level metrics and real‑world performance.

Results & Findings

MetricCORVET (ASIC)Prior Art (e.g., [Reference])
Compute density4.83 TOPS/mm²3.2 TOPS/mm²
Energy efficiency11.67 TOPS/W7.9 TOPS/W
MAC latency reduction33 %
Power per MAC21 % lower
Throughput (same area) higher
Supported precision4/8/16 bit, mixed‑modeFixed 8‑bit

On the Pynq‑Z2 prototype, a YOLO‑tiny object detector ran at ~45 fps with a ≈0.8 W power envelope, while a ResNet‑18 classifier hit ~70 fps under the same budget—both well above the baseline FPGA implementations.

Practical Implications

  • Edge AI Deployments – Devices like smart cameras, wearables, or industrial sensors can now host more sophisticated models (e.g., detection + classification) without exceeding tight power or silicon budgets.
  • Dynamic Accuracy Trade‑offs – Applications that can tolerate occasional approximation (e.g., early‑stage filtering) can run in the fast mode, reserving accurate mode for critical decisions, effectively implementing quality‑of‑service at the hardware level.
  • Scalable Design – The time‑multiplexed PE array lets chip designers scale the engine up or down (e.g., 128‑PE for ultra‑low‑cost chips, 512‑PE for higher‑end edge SoCs) while preserving the same per‑PE efficiency.
  • Simplified Toolchain – By exposing precision/mode hints in the compiler, software teams can target CORVET without hand‑crafting low‑level RTL, accelerating time‑to‑market for AIoT products.
  • Reduced Memory Bandwidth – Integrated pooling/normalisation means fewer off‑chip memory accesses, a common bottleneck in edge accelerators, further cutting energy consumption.

Limitations & Future Work

  • Approximation Accuracy Bounds – The paper provides empirical error analyses for a few networks, but a formal framework for guaranteeing worst‑case error across arbitrary models is missing.
  • ASIC Production Validation – Results are based on post‑layout simulations; silicon tape‑out and real‑world silicon measurements would be needed to confirm the claimed gains.
  • Support for Larger Bit‑Widths – While 4/8/16‑bit covers many edge use cases, emerging quantisation schemes (e.g., 2‑bit or mixed‑int‑float) are not yet addressed.
  • Software Ecosystem – Integration with mainstream AI frameworks (TensorFlow Lite, ONNX Runtime) is only sketched; a full runtime library would ease adoption.

Future research directions include extending the CORDIC MAC to support ultra‑low‑precision (2‑bit) operations, developing a formal error‑propagation model for adaptive precision, and fabricating a silicon prototype to validate the ASIC‑level power/area claims.

Authors

  • Sonu Kumar
  • Mohd Faisal Khan
  • Mukul Lokhande
  • Santosh Kumar Vishvakarma

Paper Information

  • arXiv ID: 2602.19268v1
  • Categories: cs.AR, cs.AI, cs.CV, cs.NE, eess.IV
  • Published: February 22, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »