[Paper] AutoNeural: Co-Designing Vision-Language Models for NPU Inference

Published: (December 2, 2025 at 11:45 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.02924v1

Overview

The paper introduces AutoNeural, a vision‑language model (VLM) that is built from the ground up for inference on Neural Processing Units (NPUs). By redesigning both the visual and language backbones to match the integer‑only, high‑throughput nature of NPUs, the authors achieve dramatic speed‑ups and lower quantization error, making real‑time multimodal AI feasible on edge devices such as automotive cockpits.

Key Contributions

  • Co‑designed NPU‑native architecture: Replaces the standard Vision Transformer (ViT) encoder with a MobileNetV5‑style depthwise‑separable CNN that quantizes cleanly to INT4/8/16.
  • Hybrid language backbone: Merges State‑Space Model (SSM) concepts with Transformer layers, using gated convolutions for linear‑time attention and eliminating costly KV‑cache I/O.
  • Integer‑only inference pipeline: End‑to‑end model runs without floating‑point operations, preserving accuracy while exploiting NPU arithmetic units.
  • Substantial efficiency gains: Up to 7× lower quantization error for the vision encoder, 14× lower end‑to‑end latency, 3× faster decoding, and a 4× longer context window versus GPU‑centric baselines.
  • Real‑world validation: Demonstrated on Qualcomm SA8295P SoC in an automotive cockpit scenario, achieving real‑time performance for vision‑language tasks.

Methodology

  1. Vision Encoder Redesign

    • Swapped the ViT for a MobileNetV5‑style CNN that relies on depthwise separable convolutions.
    • This architecture naturally keeps activation ranges bounded, which is crucial for stable INT4/8/16 quantization on NPUs.
  2. Language Decoder Redesign

    • Integrated State‑Space Model (SSM) blocks with conventional Transformer layers.
    • Used gated convolutions to implement attention with O(L) (linear) complexity instead of the usual O(L²), removing the need for large key‑value caches that would otherwise flood the NPU’s memory bandwidth.
  3. Co‑Design Loop

    • Conducted a hardware‑aware search where model hyper‑parameters (e.g., channel width, SSM state size) were tuned to match the NPU’s compute‑to‑memory ratio.
    • Quantization‑aware training ensured that the final integer‑only model retained comparable accuracy to its floating‑point counterpart.
  4. Evaluation Setup

    • Benchmarked against a standard ViT‑Transformer VLM on the same hardware.
    • Measured quantization error, latency, decoding speed, and context length on the Qualcomm SA8295P NPU.

Results & Findings

MetricBaseline (GPU‑oriented VLM)AutoNeural (NPU‑native)
Vision encoder quantization error– (high)7× lower
End‑to‑end inference latency140 ms10 ms (≈ 14× faster)
Decoding throughput (tokens/s)3090 (≈ 3×)
Maximum context window256 tokens1024 tokens (≈ 4×)
Real‑time performance on automotive cockpit demoNot feasibleAchieved ≤ 30 ms per frame

The results show that the co‑designed architecture not only runs faster but also scales to longer sequences without hitting memory bottlenecks, all while preserving the task accuracy needed for vision‑language applications.

Practical Implications

  • Edge AI Deployment: Developers can now run sophisticated multimodal models on low‑power devices (e.g., in‑car infotainment systems, drones, wearables) without offloading to the cloud.
  • Reduced Power Consumption: Integer‑only inference on NPUs consumes far less energy than mixed‑precision GPU inference, extending battery life for portable products.
  • Simplified Software Stack: Eliminating KV‑cache management and heavy floating‑point ops means fewer dependencies and easier integration into existing NPU SDKs.
  • Longer Context for Conversational UI: The 4× larger context window enables richer, more coherent interactions in voice‑assistant or AR/VR scenarios on the edge.
  • Accelerated Prototyping: The hardware‑aware design flow demonstrated in the paper can be adapted to other modalities (audio, sensor fusion), giving product teams a template for NPU‑first model development.

Limitations & Future Work

  • Model Capacity Trade‑off: Swapping ViT for a lightweight CNN reduces the raw representational power; while accuracy is retained for the evaluated tasks, more complex vision problems may suffer.
  • Hardware Specificity: The architecture and quantization settings are tuned for Qualcomm’s SA8295P NPU; portability to other NPU families may require additional calibration.
  • SSM Maturity: State‑Space Models are still an emerging research area; stability and training dynamics can be more finicky than standard Transformers.
  • Future Directions: The authors suggest exploring automated neural architecture search (NAS) that jointly optimizes for multiple NPU platforms, extending the co‑design to include on‑device training, and investigating hybrid quantization schemes (e.g., mixed INT4/INT8) for even finer performance‑accuracy balances.

Authors

  • Wei Chen
  • Liangmin Wu
  • Yunhai Hu
  • Zhiyuan Li
  • Zhiyuan Cheng
  • Yicheng Qian
  • Lingyue Zhu
  • Zhipeng Hu
  • Luoyi Liang
  • Qiang Tang
  • Zhen Liu
  • Han Yang

Paper Information

  • arXiv ID: 2512.02924v1
  • Categories: cs.CL
  • Published: December 2, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »