[Paper] Deployment-Aligned Low-Precision Neural Architecture Search for Spaceborne Edge AI
Source: arXiv - 2604.24492v1
Overview
The paper tackles a hidden mismatch in current hardware‑aware Neural Architecture Search (NAS): most pipelines design networks assuming full‑precision (FP32) training, then later quantize them to low‑precision (e.g., FP16) for edge deployment. This “post‑hoc” step can cause a noticeable drop in accuracy, especially on ultra‑constrained devices like the Intel Movidius Myriad X VPU used in space‑borne edge AI. By weaving low‑precision constraints directly into the NAS loop, the authors close the gap between the architecture’s search‑time behavior and its real‑world, on‑device performance.
Key Contributions
- Deployment‑aligned low‑precision NAS: Introduces a simple yet effective modification that forces every candidate architecture to be fine‑tuned and evaluated under FP16 constraints during the search.
- Hardware‑aware evaluation without extra search‑space engineering: Keeps the original evolutionary NAS algorithm untouched; the only change is the precision‑aware training pipeline.
- Real‑world maritime monitoring use‑case: Demonstrates the method on a vessel‑segmentation model destined for the Myriad X VPU, a processor commonly used in satellites and other space‑borne platforms.
- Quantitative gains: Recovers ~2/3 of the accuracy loss caused by naïve post‑training quantization (0.85 → 0.78 mIoU becomes 0.85 → 0.826 mIoU) without increasing parameter count (≈96 k).
- Generalizable recipe: The approach can be applied to any hardware‑aware NAS framework that already supports a target device metric (latency, energy, etc.).
Methodology
- Search Space & Evolutionary Strategy – The authors reuse a standard NAS search space for segmentation (varying encoder depth, kernel sizes, etc.) and an evolutionary algorithm that selects, mutates, and recombines architectures based on a composite fitness score (accuracy + latency on the target VPU).
- Low‑Precision Fine‑Tuning – For each sampled architecture, after a brief warm‑up in FP32, the model is fine‑tuned for a few epochs using FP16 arithmetic (simulated on the host GPU). This forces the network to learn weights that are robust to the reduced mantissa and dynamic range of FP16.
- On‑Device Evaluation Proxy – Instead of deploying every candidate to the VPU (impractical), the authors use a calibrated latency model and a “FP16‑simulated” validation set to estimate the on‑device mIoU. The fitness function therefore already reflects the low‑precision behavior.
- Selection & Evolution – Architectures that achieve high simulated FP16 accuracy while meeting the latency budget are kept for the next generation, iterating until convergence.
The key insight is that the only extra cost is the FP16 fine‑tuning step, which is negligible compared to the overall NAS budget.
Results & Findings
| Metric | Full‑Precision NAS (post‑hoc quant) | Deployment‑aligned Low‑Precision NAS |
|---|---|---|
| Parameters | 95,791 | 95,791 (unchanged) |
| On‑Device Latency (Myriad X) | 12 ms (meets budget) | 12 ms (identical) |
| mIoU (FP32 validation) | 0.85 | 0.85 |
| mIoU (FP16 on‑device) | 0.78 | 0.826 |
- Accuracy Gap Reduction: The low‑precision aware search recovers ~66 % of the drop caused by naïve quantization.
- No Extra Complexity: Parameter count and latency remain identical, proving that the improvement stems purely from better numerical robustness.
- Robustness Across Seeds: Repeating the NAS with different random seeds yields consistent gains, indicating the method is not a fluke.
Practical Implications
- Space‑borne Edge AI: Satellites and high‑altitude platforms often rely on low‑power VPUs; this technique directly translates to more reliable on‑board perception (e.g., maritime traffic monitoring, disaster mapping).
- Edge Device Manufacturers: Chip designers can provide a “low‑precision simulation layer” that NAS tools can hook into, enabling co‑design of models that are born for the hardware rather than retro‑fitted.
- Developer Workflow: Teams can keep using familiar NAS frameworks (e.g., AutoML, NNI) and simply swap in an FP16 fine‑tuning hook—no need to redesign search spaces or write custom quantization‑aware layers.
- Cost Savings: By avoiding a post‑search quantization step that often requires manual re‑training or accuracy‑loss mitigation, product cycles shrink and the risk of field‑failures drops.
- Generalization: While the paper focuses on FP16 and the Myriad X, the same principle applies to INT8, bfloat16, or any custom numeric format supported by the target accelerator.
Limitations & Future Work
- Precision Scope: The study only explores FP16; ultra‑low‑bit formats (INT8, 4‑bit) may exhibit different training dynamics and could need more sophisticated loss scaling or regularization.
- Search Overhead: Adding FP16 fine‑tuning modestly increases NAS runtime; scaling to larger search spaces (e.g., ImageNet‑scale models) may demand more efficient proxy metrics.
- Hardware Diversity: Results are tied to the Myriad X VPU; cross‑device validation (e.g., Edge TPU, NVIDIA Jetson) would strengthen the claim of universal applicability.
- Theoretical Guarantees: The paper provides empirical evidence but lacks a formal analysis of why certain architectures become more numerically robust under low‑precision training.
Future research could extend the framework to mixed‑precision NAS, incorporate hardware‑specific quantization error models, or combine the approach with differentiable NAS for faster convergence.
Authors
- Parampuneet Kaur Thind
- Vaibhav Katturu
- Giacomo Zema
- Roberto Del Prete
Paper Information
- arXiv ID: 2604.24492v1
- Categories: cs.CV, cs.AI, cs.ET, cs.LG, cs.NE
- Published: April 27, 2026
- PDF: Download PDF