[Paper] AutoNeural: Co-Designing Vision-Language Models for NPU Inference

Published: 2 months ago (December 2, 2025 at 11:45 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.02924v1

Overview

The paper introduces AutoNeural, a vision‑language model (VLM) that is built from the ground up for inference on Neural Processing Units (NPUs). By redesigning both the visual and language backbones to match the integer‑only, high‑throughput nature of NPUs, the authors achieve dramatic speed‑ups and lower quantization error, making real‑time multimodal AI feasible on edge devices such as automotive cockpits.

Key Contributions

Co‑designed NPU‑native architecture: Replaces the standard Vision Transformer (ViT) encoder with a MobileNetV5‑style depthwise‑separable CNN that quantizes cleanly to INT4/8/16.
Hybrid language backbone: Merges State‑Space Model (SSM) concepts with Transformer layers, using gated convolutions for linear‑time attention and eliminating costly KV‑cache I/O.
Integer‑only inference pipeline: End‑to‑end model runs without floating‑point operations, preserving accuracy while exploiting NPU arithmetic units.
Substantial efficiency gains: Up to 7× lower quantization error for the vision encoder, 14× lower end‑to‑end latency, 3× faster decoding, and a 4× longer context window versus GPU‑centric baselines.
Real‑world validation: Demonstrated on Qualcomm SA8295P SoC in an automotive cockpit scenario, achieving real‑time performance for vision‑language tasks.

Methodology

Vision Encoder Redesign
- Swapped the ViT for a MobileNetV5‑style CNN that relies on depthwise separable convolutions.
- This architecture naturally keeps activation ranges bounded, which is crucial for stable INT4/8/16 quantization on NPUs.
Language Decoder Redesign
- Integrated State‑Space Model (SSM) blocks with conventional Transformer layers.
- Used gated convolutions to implement attention with O(L) (linear) complexity instead of the usual O(L²), removing the need for large key‑value caches that would otherwise flood the NPU’s memory bandwidth.
Co‑Design Loop
- Conducted a hardware‑aware search where model hyper‑parameters (e.g., channel width, SSM state size) were tuned to match the NPU’s compute‑to‑memory ratio.
- Quantization‑aware training ensured that the final integer‑only model retained comparable accuracy to its floating‑point counterpart.
Evaluation Setup
- Benchmarked against a standard ViT‑Transformer VLM on the same hardware.
- Measured quantization error, latency, decoding speed, and context length on the Qualcomm SA8295P NPU.

Results & Findings

Metric	Baseline (GPU‑oriented VLM)	AutoNeural (NPU‑native)
Vision encoder quantization error	– (high)	7× lower
End‑to‑end inference latency	140 ms	10 ms (≈ 14× faster)
Decoding throughput (tokens/s)	30	90 (≈ 3×)
Maximum context window	256 tokens	1024 tokens (≈ 4×)
Real‑time performance on automotive cockpit demo	Not feasible	Achieved ≤ 30 ms per frame

The results show that the co‑designed architecture not only runs faster but also scales to longer sequences without hitting memory bottlenecks, all while preserving the task accuracy needed for vision‑language applications.

Practical Implications

Edge AI Deployment: Developers can now run sophisticated multimodal models on low‑power devices (e.g., in‑car infotainment systems, drones, wearables) without offloading to the cloud.
Reduced Power Consumption: Integer‑only inference on NPUs consumes far less energy than mixed‑precision GPU inference, extending battery life for portable products.
Simplified Software Stack: Eliminating KV‑cache management and heavy floating‑point ops means fewer dependencies and easier integration into existing NPU SDKs.
Longer Context for Conversational UI: The 4× larger context window enables richer, more coherent interactions in voice‑assistant or AR/VR scenarios on the edge.
Accelerated Prototyping: The hardware‑aware design flow demonstrated in the paper can be adapted to other modalities (audio, sensor fusion), giving product teams a template for NPU‑first model development.

Limitations & Future Work

Model Capacity Trade‑off: Swapping ViT for a lightweight CNN reduces the raw representational power; while accuracy is retained for the evaluated tasks, more complex vision problems may suffer.
Hardware Specificity: The architecture and quantization settings are tuned for Qualcomm’s SA8295P NPU; portability to other NPU families may require additional calibration.
SSM Maturity: State‑Space Models are still an emerging research area; stability and training dynamics can be more finicky than standard Transformers.
Future Directions: The authors suggest exploring automated neural architecture search (NAS) that jointly optimizes for multiple NPU platforms, extending the co‑design to include on‑device training, and investigating hybrid quantization schemes (e.g., mixed INT4/INT8) for even finer performance‑accuracy balances.

Authors

Wei Chen
Liangmin Wu
Yunhai Hu
Zhiyuan Li
Zhiyuan Cheng
Yicheng Qian
Lingyue Zhu
Zhipeng Hu
Luoyi Liang
Qiang Tang
Zhen Liu
Han Yang

Paper Information

arXiv ID: 2512.02924v1
Categories: cs.CL
Published: December 2, 2025
PDF: Download PDF

[Paper] AutoNeural: Co-Designing Vision-Language Models for NPU Inference

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Enhancing Retrieval-Augmented Generation with Entity Linking for Educational Platforms

[Paper] M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

[Paper] Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding

[Paper] To Err Is Human: Systematic Quantification of Errors in Published AI Papers via LLM Analysis