[Paper] CORVET: A CORDIC-Powered, Resource-Frugal Mixed-Precision Vector Processing Engine for High-Throughput AIoT applications

Published: 3 days ago (February 22, 2026 at 11:51 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.19268v1

Overview

A new paper introduces CORVET, a mixed‑precision vector processing engine that uses a CORDIC‑based multiply‑accumulate (MAC) unit to deliver high‑throughput AI inference on ultra‑low‑resource edge devices. By dynamically switching between approximate and accurate computation modes, CORVET squeezes up to 4× more operations per second out of the same silicon area, making it a strong candidate for AI‑of‑Things (AIoT) workloads such as object detection and classification.

Key Contributions

CORDIC‑powered MAC: An iterative, resource‑frugal MAC that cuts latency by up to 33 % and power by 21 % compared with conventional multipliers.
Runtime‑adaptive precision: Supports 4‑, 8‑, and 16‑bit data widths and can toggle between approximate (fast, low‑accuracy) and accurate (slow, high‑accuracy) modes on the fly.
Time‑multiplexed vector engine: A 256‑PE (processing element) array that reuses hardware across vector lanes, achieving 4.83 TOPS/mm² compute density and 11.67 TOPS/W energy efficiency.
Lightweight pooling & normalization block: Integrated post‑processing that avoids extra memory traffic and keeps the data path tight.
Hardware‑software co‑design flow: Demonstrated on a Pynq‑Z2 FPGA platform for real‑world object detection/classification pipelines, showing end‑to‑end scalability.

Methodology

The authors built a mixed‑precision vector engine around a CORDIC (Coordinate Rotation Digital Computer) unit, which computes multiplications through a series of shift‑add iterations rather than full‑width multipliers. This yields a smaller, power‑lean MAC cell.

Key architectural tricks

Dynamic Mode Switching – A control FSM selects either an approximate CORDIC configuration (fewer iterations, lower latency) or a full‑accuracy configuration (more iterations) based on the current layer’s tolerance to error.
Vectorisation & Time‑Multiplexing – A single MAC array is shared across multiple vector lanes; the engine cycles through lanes each clock, effectively multiplying throughput without replicating hardware.
Precision Scaling – Input operands are quantized to 4/8/16 bits on the fly; the CORDIC pipeline automatically adapts its shift‑add schedule to the chosen bit‑width, keeping latency proportional to precision.
Co‑design with Software – The authors extended a compiler backend to emit control hints (precision, mode) for each layer of a neural network, allowing the hardware to reconfigure at runtime with negligible overhead.

The design was synthesized both as an ASIC macro and as an FPGA overlay (on a Xilinx Pynq‑Z2 board) to validate silicon‑level metrics and real‑world performance.

Results & Findings

Metric	CORVET (ASIC)	Prior Art (e.g., [Reference])
Compute density	4.83 TOPS/mm²	3.2 TOPS/mm²
Energy efficiency	11.67 TOPS/W	7.9 TOPS/W
MAC latency reduction	33 %	–
Power per MAC	21 % lower	–
Throughput (same area)	4× higher	–
Supported precision	4/8/16 bit, mixed‑mode	Fixed 8‑bit

On the Pynq‑Z2 prototype, a YOLO‑tiny object detector ran at ~45 fps with a ≈0.8 W power envelope, while a ResNet‑18 classifier hit ~70 fps under the same budget—both well above the baseline FPGA implementations.

Practical Implications

Edge AI Deployments – Devices like smart cameras, wearables, or industrial sensors can now host more sophisticated models (e.g., detection + classification) without exceeding tight power or silicon budgets.
Dynamic Accuracy Trade‑offs – Applications that can tolerate occasional approximation (e.g., early‑stage filtering) can run in the fast mode, reserving accurate mode for critical decisions, effectively implementing quality‑of‑service at the hardware level.
Scalable Design – The time‑multiplexed PE array lets chip designers scale the engine up or down (e.g., 128‑PE for ultra‑low‑cost chips, 512‑PE for higher‑end edge SoCs) while preserving the same per‑PE efficiency.
Simplified Toolchain – By exposing precision/mode hints in the compiler, software teams can target CORVET without hand‑crafting low‑level RTL, accelerating time‑to‑market for AIoT products.
Reduced Memory Bandwidth – Integrated pooling/normalisation means fewer off‑chip memory accesses, a common bottleneck in edge accelerators, further cutting energy consumption.

Limitations & Future Work

Approximation Accuracy Bounds – The paper provides empirical error analyses for a few networks, but a formal framework for guaranteeing worst‑case error across arbitrary models is missing.
ASIC Production Validation – Results are based on post‑layout simulations; silicon tape‑out and real‑world silicon measurements would be needed to confirm the claimed gains.
Support for Larger Bit‑Widths – While 4/8/16‑bit covers many edge use cases, emerging quantisation schemes (e.g., 2‑bit or mixed‑int‑float) are not yet addressed.
Software Ecosystem – Integration with mainstream AI frameworks (TensorFlow Lite, ONNX Runtime) is only sketched; a full runtime library would ease adoption.

Future research directions include extending the CORDIC MAC to support ultra‑low‑precision (2‑bit) operations, developing a formal error‑propagation model for adaptive precision, and fabricating a silicon prototype to validate the ASIC‑level power/area claims.

Authors

Sonu Kumar
Mohd Faisal Khan
Mukul Lokhande
Santosh Kumar Vishvakarma

Paper Information

arXiv ID: 2602.19268v1
Categories: cs.AR, cs.AI, cs.CV, cs.NE, eess.IV
Published: February 22, 2026
PDF: Download PDF

[Paper] CORVET: A CORDIC-Powered, Resource-Frugal Mixed-Precision Vector Processing Engine for High-Throughput AIoT applications

Overview

Key Contributions

Methodology

Key architectural tricks

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Test-Time Training with KV Binding Is Secretly Linear Attention

[Paper] Squint: Fast Visual Reinforcement Learning for Sim-to-Real Robotics

[Paper] Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

[Paper] XMorph: Explainable Brain Tumor Analysis Via LLM-Assisted Hybrid Deep Intelligence