[Paper] SigmaQuant: Hardware-Aware Heterogeneous Quantization Method for Edge DNN Inference

Published: 3 days ago (February 25, 2026 at 12:34 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.22136v1

Overview

Edge‑AI is hitting a wall: powerful DNNs demand more memory, energy, and compute than tiny devices can spare. The paper SigmaQuant: Hardware‑Aware Heterogeneous Quantization Method for Edge DNN Inference proposes a new way to shrink models without the usual accuracy hit, by automatically assigning the right number of bits to each layer based on the target hardware constraints.

Key Contributions

SigmaQuant framework – a fast, hardware‑aware algorithm that decides per‑layer bitwidths (heterogeneous quantization) without exhaustive brute‑force search.
Hardware‑driven cost model – integrates memory, energy, and latency budgets directly into the quantization decision process.
Layer‑sensitivity analysis – quantifies how much each layer tolerates low‑precision without harming overall accuracy, guiding the bitwidth allocation.
Empirical validation on multiple edge platforms (e.g., ARM Cortex‑M, Qualcomm Snapdragon) showing up to 2–4× reduction in memory/energy while keeping <1 % top‑1 accuracy loss compared with full‑precision models.
Open‑source implementation (Python + TensorFlow/PyTorch wrappers) that can be plugged into existing model‑compression pipelines.

Methodology

Profiling the target hardware – The authors first measure or estimate three key metrics for each possible bitwidth (2‑8 bits): memory footprint, energy per MAC, and latency.
Layer sensitivity scoring – Using a small calibration dataset, they run a quick forward pass with mixed‑precision candidates and compute the change in loss per bit reduction. Layers that cause a large loss when quantized are marked “sensitive.”
Optimization loop – Starting from a uniform low‑bitwidth baseline, SigmaQuant iteratively greedily upgrades the most sensitive layers (i.e., assigns them a higher bitwidth) until the overall hardware budget is satisfied. The loop stops when any further upgrade would violate memory, energy, or latency limits.
Fine‑tuning – After the bitwidth map is fixed, the network undergoes a short mixed‑precision fine‑tuning phase (typically 5–10 epochs) to recover any residual accuracy loss.
Deployment wrapper – The final quantized model is exported in a format compatible with popular edge runtimes (e.g., TensorFlow Lite, ONNX Runtime), with per‑layer quantization parameters embedded.

The whole pipeline runs in minutes on a workstation, a stark contrast to prior methods that required hours of exhaustive search or reinforcement‑learning based exploration.

Results & Findings

Model (Dataset)	Baseline FP32 Acc.	Uniform 4‑bit Acc.	SigmaQuant (mixed) Acc.	Memory ↓	Energy ↓	Latency ↓
MobileNet‑V2 (ImageNet)	71.8 %	68.3 %	71.1 %	3.2×	2.9×	2.5×
ResNet‑18 (CIFAR‑10)	93.2 %	90.5 %	92.8 %	2.8×	2.6×	2.3×
TinyYOLO (COCO)	41.5 % mAP	37.0 %	40.8 %	3.5×	3.1×	2.8×

Key takeaways

Accuracy preservation: Heterogeneous quantization recovers most of the accuracy lost by uniform low‑bit quantization, often within 0.5 % of the full‑precision baseline.
Resource gains: Memory, energy, and latency reductions are consistently above 2×, meeting typical edge constraints (e.g., <1 MB model size, <10 ms inference).
Speed of search: SigmaQuant finds a near‑optimal bitwidth schedule in < 10 minutes, compared to > 4 hours for grid‑search baselines.

Practical Implications

Faster time‑to‑market for edge AI products – Engineers can plug SigmaQuant into their CI/CD pipelines and automatically generate hardware‑specific models without manual trial‑and‑error.
Battery‑life extensions – By lowering per‑operation energy, devices such as wearables, drones, or IoT cameras can run inference longer on a single charge.
Scalable across heterogeneous hardware – The cost model can be calibrated for any SoC, making the same codebase work for low‑end microcontrollers and high‑end mobile CPUs alike.
Enables ultra‑low‑bit deployments – Developers can now consider 2‑bit or 3‑bit quantization for non‑critical layers, opening the door to sub‑megabyte DNNs for truly constrained devices.
Compatibility with existing toolchains – Since the output follows TensorFlow Lite/ONNX standards, existing runtimes can immediately take advantage of the mixed‑precision model without custom kernels.

Limitations & Future Work

Calibration data requirement – The sensitivity analysis needs a small, representative dataset; performance may degrade if the calibration set is not well‑matched to the deployment domain.
Static hardware profiling – The current cost model assumes fixed hardware characteristics; dynamic voltage/frequency scaling or runtime thermal throttling are not yet accounted for.
Limited to feed‑forward CNNs – Experiments focus on vision models; applying SigmaQuant to transformers, RNNs, or graph networks will require additional layer‑type handling.
Future directions mentioned by the authors include:
1. Extending the optimizer to a multi‑objective formulation (e.g., jointly minimizing latency and energy).
2. Integrating reinforcement‑learning to adapt bitwidths on‑the‑fly for runtime‑varying constraints.
3. Open‑sourcing a hardware‑agnostic profiler that can auto‑extract the cost model from any edge device.

Authors

Qunyou Liu
Pengbo Yu
Marina Zapater
David Atienza

Paper Information

arXiv ID: 2602.22136v1
Categories: cs.LG, cs.AR
Published: February 25, 2026
PDF: Download PDF

[Paper] SigmaQuant: Hardware-Aware Heterogeneous Quantization Method for Edge DNN Inference

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Model Agreement via Anchoring

[Paper] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

[Paper] A Dataset is Worth 1 MB

[Paper] SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport