[Paper] CORVET: A CORDIC-Powered, Resource-Frugal Mixed-Precision Vector Processing Engine for High-Throughput AIoT applications
Source: arXiv - 2602.19268v1
Overview
A new paper introduces CORVET, a mixed‑precision vector processing engine that uses a CORDIC‑based multiply‑accumulate (MAC) unit to deliver high‑throughput AI inference on ultra‑low‑resource edge devices. By dynamically switching between approximate and accurate computation modes, CORVET squeezes up to 4× more operations per second out of the same silicon area, making it a strong candidate for AI‑of‑Things (AIoT) workloads such as object detection and classification.
Key Contributions
- CORDIC‑powered MAC: An iterative, resource‑frugal MAC that cuts latency by up to 33 % and power by 21 % compared with conventional multipliers.
- Runtime‑adaptive precision: Supports 4‑, 8‑, and 16‑bit data widths and can toggle between approximate (fast, low‑accuracy) and accurate (slow, high‑accuracy) modes on the fly.
- Time‑multiplexed vector engine: A 256‑PE (processing element) array that reuses hardware across vector lanes, achieving 4.83 TOPS/mm² compute density and 11.67 TOPS/W energy efficiency.
- Lightweight pooling & normalization block: Integrated post‑processing that avoids extra memory traffic and keeps the data path tight.
- Hardware‑software co‑design flow: Demonstrated on a Pynq‑Z2 FPGA platform for real‑world object detection/classification pipelines, showing end‑to‑end scalability.
Methodology
The authors built a mixed‑precision vector engine around a CORDIC (Coordinate Rotation Digital Computer) unit, which computes multiplications through a series of shift‑add iterations rather than full‑width multipliers. This yields a smaller, power‑lean MAC cell.
Key architectural tricks
- Dynamic Mode Switching – A control FSM selects either an approximate CORDIC configuration (fewer iterations, lower latency) or a full‑accuracy configuration (more iterations) based on the current layer’s tolerance to error.
- Vectorisation & Time‑Multiplexing – A single MAC array is shared across multiple vector lanes; the engine cycles through lanes each clock, effectively multiplying throughput without replicating hardware.
- Precision Scaling – Input operands are quantized to 4/8/16 bits on the fly; the CORDIC pipeline automatically adapts its shift‑add schedule to the chosen bit‑width, keeping latency proportional to precision.
- Co‑design with Software – The authors extended a compiler backend to emit control hints (precision, mode) for each layer of a neural network, allowing the hardware to reconfigure at runtime with negligible overhead.
The design was synthesized both as an ASIC macro and as an FPGA overlay (on a Xilinx Pynq‑Z2 board) to validate silicon‑level metrics and real‑world performance.
Results & Findings
| Metric | CORVET (ASIC) | Prior Art (e.g., [Reference]) |
|---|---|---|
| Compute density | 4.83 TOPS/mm² | 3.2 TOPS/mm² |
| Energy efficiency | 11.67 TOPS/W | 7.9 TOPS/W |
| MAC latency reduction | 33 % | – |
| Power per MAC | 21 % lower | – |
| Throughput (same area) | 4× higher | – |
| Supported precision | 4/8/16 bit, mixed‑mode | Fixed 8‑bit |
On the Pynq‑Z2 prototype, a YOLO‑tiny object detector ran at ~45 fps with a ≈0.8 W power envelope, while a ResNet‑18 classifier hit ~70 fps under the same budget—both well above the baseline FPGA implementations.
Practical Implications
- Edge AI Deployments – Devices like smart cameras, wearables, or industrial sensors can now host more sophisticated models (e.g., detection + classification) without exceeding tight power or silicon budgets.
- Dynamic Accuracy Trade‑offs – Applications that can tolerate occasional approximation (e.g., early‑stage filtering) can run in the fast mode, reserving accurate mode for critical decisions, effectively implementing quality‑of‑service at the hardware level.
- Scalable Design – The time‑multiplexed PE array lets chip designers scale the engine up or down (e.g., 128‑PE for ultra‑low‑cost chips, 512‑PE for higher‑end edge SoCs) while preserving the same per‑PE efficiency.
- Simplified Toolchain – By exposing precision/mode hints in the compiler, software teams can target CORVET without hand‑crafting low‑level RTL, accelerating time‑to‑market for AIoT products.
- Reduced Memory Bandwidth – Integrated pooling/normalisation means fewer off‑chip memory accesses, a common bottleneck in edge accelerators, further cutting energy consumption.
Limitations & Future Work
- Approximation Accuracy Bounds – The paper provides empirical error analyses for a few networks, but a formal framework for guaranteeing worst‑case error across arbitrary models is missing.
- ASIC Production Validation – Results are based on post‑layout simulations; silicon tape‑out and real‑world silicon measurements would be needed to confirm the claimed gains.
- Support for Larger Bit‑Widths – While 4/8/16‑bit covers many edge use cases, emerging quantisation schemes (e.g., 2‑bit or mixed‑int‑float) are not yet addressed.
- Software Ecosystem – Integration with mainstream AI frameworks (TensorFlow Lite, ONNX Runtime) is only sketched; a full runtime library would ease adoption.
Future research directions include extending the CORDIC MAC to support ultra‑low‑precision (2‑bit) operations, developing a formal error‑propagation model for adaptive precision, and fabricating a silicon prototype to validate the ASIC‑level power/area claims.
Authors
- Sonu Kumar
- Mohd Faisal Khan
- Mukul Lokhande
- Santosh Kumar Vishvakarma
Paper Information
- arXiv ID: 2602.19268v1
- Categories: cs.AR, cs.AI, cs.CV, cs.NE, eess.IV
- Published: February 22, 2026
- PDF: Download PDF