[Paper] SigmaQuant: Hardware-Aware Heterogeneous Quantization Method for Edge DNN Inference

Published: (February 25, 2026 at 12:34 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.22136v1

Overview

Edge‑AI is hitting a wall: powerful DNNs demand more memory, energy, and compute than tiny devices can spare. The paper SigmaQuant: Hardware‑Aware Heterogeneous Quantization Method for Edge DNN Inference proposes a new way to shrink models without the usual accuracy hit, by automatically assigning the right number of bits to each layer based on the target hardware constraints.

Key Contributions

  • SigmaQuant framework – a fast, hardware‑aware algorithm that decides per‑layer bitwidths (heterogeneous quantization) without exhaustive brute‑force search.
  • Hardware‑driven cost model – integrates memory, energy, and latency budgets directly into the quantization decision process.
  • Layer‑sensitivity analysis – quantifies how much each layer tolerates low‑precision without harming overall accuracy, guiding the bitwidth allocation.
  • Empirical validation on multiple edge platforms (e.g., ARM Cortex‑M, Qualcomm Snapdragon) showing up to 2–4× reduction in memory/energy while keeping <1 % top‑1 accuracy loss compared with full‑precision models.
  • Open‑source implementation (Python + TensorFlow/PyTorch wrappers) that can be plugged into existing model‑compression pipelines.

Methodology

  1. Profiling the target hardware – The authors first measure or estimate three key metrics for each possible bitwidth (2‑8 bits): memory footprint, energy per MAC, and latency.
  2. Layer sensitivity scoring – Using a small calibration dataset, they run a quick forward pass with mixed‑precision candidates and compute the change in loss per bit reduction. Layers that cause a large loss when quantized are marked “sensitive.”
  3. Optimization loop – Starting from a uniform low‑bitwidth baseline, SigmaQuant iteratively greedily upgrades the most sensitive layers (i.e., assigns them a higher bitwidth) until the overall hardware budget is satisfied. The loop stops when any further upgrade would violate memory, energy, or latency limits.
  4. Fine‑tuning – After the bitwidth map is fixed, the network undergoes a short mixed‑precision fine‑tuning phase (typically 5–10 epochs) to recover any residual accuracy loss.
  5. Deployment wrapper – The final quantized model is exported in a format compatible with popular edge runtimes (e.g., TensorFlow Lite, ONNX Runtime), with per‑layer quantization parameters embedded.

The whole pipeline runs in minutes on a workstation, a stark contrast to prior methods that required hours of exhaustive search or reinforcement‑learning based exploration.

Results & Findings

Model (Dataset)Baseline FP32 Acc.Uniform 4‑bit Acc.SigmaQuant (mixed) Acc.Memory ↓Energy ↓Latency ↓
MobileNet‑V2 (ImageNet)71.8 %68.3 %71.1 %3.2×2.9×2.5×
ResNet‑18 (CIFAR‑10)93.2 %90.5 %92.8 %2.8×2.6×2.3×
TinyYOLO (COCO)41.5 % mAP37.0 %40.8 %3.5×3.1×2.8×

Key takeaways

  • Accuracy preservation: Heterogeneous quantization recovers most of the accuracy lost by uniform low‑bit quantization, often within 0.5 % of the full‑precision baseline.
  • Resource gains: Memory, energy, and latency reductions are consistently above 2×, meeting typical edge constraints (e.g., <1 MB model size, <10 ms inference).
  • Speed of search: SigmaQuant finds a near‑optimal bitwidth schedule in < 10 minutes, compared to > 4 hours for grid‑search baselines.

Practical Implications

  • Faster time‑to‑market for edge AI products – Engineers can plug SigmaQuant into their CI/CD pipelines and automatically generate hardware‑specific models without manual trial‑and‑error.
  • Battery‑life extensions – By lowering per‑operation energy, devices such as wearables, drones, or IoT cameras can run inference longer on a single charge.
  • Scalable across heterogeneous hardware – The cost model can be calibrated for any SoC, making the same codebase work for low‑end microcontrollers and high‑end mobile CPUs alike.
  • Enables ultra‑low‑bit deployments – Developers can now consider 2‑bit or 3‑bit quantization for non‑critical layers, opening the door to sub‑megabyte DNNs for truly constrained devices.
  • Compatibility with existing toolchains – Since the output follows TensorFlow Lite/ONNX standards, existing runtimes can immediately take advantage of the mixed‑precision model without custom kernels.

Limitations & Future Work

  • Calibration data requirement – The sensitivity analysis needs a small, representative dataset; performance may degrade if the calibration set is not well‑matched to the deployment domain.
  • Static hardware profiling – The current cost model assumes fixed hardware characteristics; dynamic voltage/frequency scaling or runtime thermal throttling are not yet accounted for.
  • Limited to feed‑forward CNNs – Experiments focus on vision models; applying SigmaQuant to transformers, RNNs, or graph networks will require additional layer‑type handling.
  • Future directions mentioned by the authors include:
    1. Extending the optimizer to a multi‑objective formulation (e.g., jointly minimizing latency and energy).
    2. Integrating reinforcement‑learning to adapt bitwidths on‑the‑fly for runtime‑varying constraints.
    3. Open‑sourcing a hardware‑agnostic profiler that can auto‑extract the cost model from any edge device.

Authors

  • Qunyou Liu
  • Pengbo Yu
  • Marina Zapater
  • David Atienza

Paper Information

  • arXiv ID: 2602.22136v1
  • Categories: cs.LG, cs.AR
  • Published: February 25, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Model Agreement via Anchoring

Numerous lines of aim to control model disagreement -- the extent to which two machine learning models disagree in their predictions. We adopt a simple and stan...

[Paper] A Dataset is Worth 1 MB

A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on divers...