[Paper] SigmaQuant: Hardware-Aware Heterogeneous Quantization Method for Edge DNN Inference
Source: arXiv - 2602.22136v1
Overview
Edge‑AI is hitting a wall: powerful DNNs demand more memory, energy, and compute than tiny devices can spare. The paper SigmaQuant: Hardware‑Aware Heterogeneous Quantization Method for Edge DNN Inference proposes a new way to shrink models without the usual accuracy hit, by automatically assigning the right number of bits to each layer based on the target hardware constraints.
Key Contributions
- SigmaQuant framework – a fast, hardware‑aware algorithm that decides per‑layer bitwidths (heterogeneous quantization) without exhaustive brute‑force search.
- Hardware‑driven cost model – integrates memory, energy, and latency budgets directly into the quantization decision process.
- Layer‑sensitivity analysis – quantifies how much each layer tolerates low‑precision without harming overall accuracy, guiding the bitwidth allocation.
- Empirical validation on multiple edge platforms (e.g., ARM Cortex‑M, Qualcomm Snapdragon) showing up to 2–4× reduction in memory/energy while keeping <1 % top‑1 accuracy loss compared with full‑precision models.
- Open‑source implementation (Python + TensorFlow/PyTorch wrappers) that can be plugged into existing model‑compression pipelines.
Methodology
- Profiling the target hardware – The authors first measure or estimate three key metrics for each possible bitwidth (2‑8 bits): memory footprint, energy per MAC, and latency.
- Layer sensitivity scoring – Using a small calibration dataset, they run a quick forward pass with mixed‑precision candidates and compute the change in loss per bit reduction. Layers that cause a large loss when quantized are marked “sensitive.”
- Optimization loop – Starting from a uniform low‑bitwidth baseline, SigmaQuant iteratively greedily upgrades the most sensitive layers (i.e., assigns them a higher bitwidth) until the overall hardware budget is satisfied. The loop stops when any further upgrade would violate memory, energy, or latency limits.
- Fine‑tuning – After the bitwidth map is fixed, the network undergoes a short mixed‑precision fine‑tuning phase (typically 5–10 epochs) to recover any residual accuracy loss.
- Deployment wrapper – The final quantized model is exported in a format compatible with popular edge runtimes (e.g., TensorFlow Lite, ONNX Runtime), with per‑layer quantization parameters embedded.
The whole pipeline runs in minutes on a workstation, a stark contrast to prior methods that required hours of exhaustive search or reinforcement‑learning based exploration.
Results & Findings
| Model (Dataset) | Baseline FP32 Acc. | Uniform 4‑bit Acc. | SigmaQuant (mixed) Acc. | Memory ↓ | Energy ↓ | Latency ↓ |
|---|---|---|---|---|---|---|
| MobileNet‑V2 (ImageNet) | 71.8 % | 68.3 % | 71.1 % | 3.2× | 2.9× | 2.5× |
| ResNet‑18 (CIFAR‑10) | 93.2 % | 90.5 % | 92.8 % | 2.8× | 2.6× | 2.3× |
| TinyYOLO (COCO) | 41.5 % mAP | 37.0 % | 40.8 % | 3.5× | 3.1× | 2.8× |
Key takeaways
- Accuracy preservation: Heterogeneous quantization recovers most of the accuracy lost by uniform low‑bit quantization, often within 0.5 % of the full‑precision baseline.
- Resource gains: Memory, energy, and latency reductions are consistently above 2×, meeting typical edge constraints (e.g., <1 MB model size, <10 ms inference).
- Speed of search: SigmaQuant finds a near‑optimal bitwidth schedule in < 10 minutes, compared to > 4 hours for grid‑search baselines.
Practical Implications
- Faster time‑to‑market for edge AI products – Engineers can plug SigmaQuant into their CI/CD pipelines and automatically generate hardware‑specific models without manual trial‑and‑error.
- Battery‑life extensions – By lowering per‑operation energy, devices such as wearables, drones, or IoT cameras can run inference longer on a single charge.
- Scalable across heterogeneous hardware – The cost model can be calibrated for any SoC, making the same codebase work for low‑end microcontrollers and high‑end mobile CPUs alike.
- Enables ultra‑low‑bit deployments – Developers can now consider 2‑bit or 3‑bit quantization for non‑critical layers, opening the door to sub‑megabyte DNNs for truly constrained devices.
- Compatibility with existing toolchains – Since the output follows TensorFlow Lite/ONNX standards, existing runtimes can immediately take advantage of the mixed‑precision model without custom kernels.
Limitations & Future Work
- Calibration data requirement – The sensitivity analysis needs a small, representative dataset; performance may degrade if the calibration set is not well‑matched to the deployment domain.
- Static hardware profiling – The current cost model assumes fixed hardware characteristics; dynamic voltage/frequency scaling or runtime thermal throttling are not yet accounted for.
- Limited to feed‑forward CNNs – Experiments focus on vision models; applying SigmaQuant to transformers, RNNs, or graph networks will require additional layer‑type handling.
- Future directions mentioned by the authors include:
- Extending the optimizer to a multi‑objective formulation (e.g., jointly minimizing latency and energy).
- Integrating reinforcement‑learning to adapt bitwidths on‑the‑fly for runtime‑varying constraints.
- Open‑sourcing a hardware‑agnostic profiler that can auto‑extract the cost model from any edge device.
Authors
- Qunyou Liu
- Pengbo Yu
- Marina Zapater
- David Atienza
Paper Information
- arXiv ID: 2602.22136v1
- Categories: cs.LG, cs.AR
- Published: February 25, 2026
- PDF: Download PDF