Unlocking Peak Performance on Qualcomm NPU with LiteRT

Published: 4 days ago (November 30, 2025 at 06:25 PM EST)

6 min read

Source: Google Developers Blog

Nov 24, 2025 – by Lu Wang, Senior Staff Software Engineer

Modern smartphones feature sophisticated SoCs (system on a chip), composed of CPU, GPU, and NPU, which can enable compelling, on‑device GenAI experiences that are significantly more interactive and real‑time than their server‑only counterparts. The GPU is the most ubiquitous accelerator for AI tasks, with GPU compute being available on roughly 90 % of all Android devices. However, solely relying on it can create performance bottlenecks, especially when building complex, interactive GenAI experiences.

Consider the following setting: running a compute‑intensive, text‑to‑image generation model on‑device, while simultaneously processing the live camera feed with an ML‑based segmentation. Even the most powerful mobile GPU will struggle under this combined load, resulting in jarring frame drops and a broken user experience.

Performance bottleneck with full GPU inference (left), and smooth user experiences with NPU/GPU parallel processing (right). Captured on Samsung Galaxy S25 Ultra powered by QC Snapdragon 8 Elite.

This is where the NPU (Neural Processing Unit) comes in. It’s a highly specialized processor that offers tens of TOPS (tera operations per second) of dedicated AI compute, far more than a modern mobile GPU can sustain. Crucially, it is significantly more power‑efficient per TOP than both CPUs and GPUs, which is essential for battery‑operated devices like mobile phones. The NPU is no longer a niche feature; it’s a standard component, with over 80 % of recent Qualcomm SoCs now including one. The NPU runs parallel to the GPU and CPU, enabling heavy AI processing while freeing the GPU for rendering and the CPU for main‑thread logic. This modern architecture unlocks the smooth, responsive, and fast performance that modern AI applications demand.

GPU + NPU pipeline

Introducing LiteRT Qualcomm AI Engine Direct Accelerator

To bring this NPU power to LiteRT, Google’s high‑performance on‑device ML framework, we are thrilled to announce a significant leap forward: the LiteRT Qualcomm AI Engine Direct (QNN) Accelerator, developed in close collaboration with Qualcomm, replacing the previous TFLite QNN delegate.

Key advantages for developers

Unified, simplified mobile deployment workflow
- No need to interact with low‑level, vendor‑specific SDKs; LiteRT integrates with SoC compilers and runtimes and exposes a streamlined API.
- No need to target individual SoC versions; LiteRT abstracts fragmentation across SoCs, allowing a single workflow to scale across multiple devices.
You can now deploy your model seamlessly across all supported devices, with either ahead‑of‑time (AOT) or on‑device compilation. This makes integrating pre‑trained .tflite models from sources like the Qualcomm AI Hub easier than ever.
State‑of‑the‑art on‑device performance
- Supports an extensive range of LiteRT ops, enabling maximum NPU usage and full model delegation.
- Includes specialized kernels and optimizations for sophisticated LLMs and GenAI models, achieving SOTA performance for models like Gemma and FastVLM.

Superior performance, real‑world results

We benchmarked the new LiteRT QNN accelerator across 72 canonical ML models (vision, audio, and NLP). Highlights:

Up to 100× speedup over CPU and 10× speedup over GPU.
Supports 90 LiteRT ops, allowing 64 of the 72 models to delegate fully to the NPU.
On Qualcomm’s Snapdragon 8 Elite Gen 5, over 56 models run in under 5 ms on the NPU, versus only 13 models achieving that on the CPU.

Representative benchmark

LiteRT NPU and GPU latency relative to CPU
Figure: LiteRT inference latency measured on Snapdragon 8 Elite Gen 5 powering the Xiaomi 17 Pro Max. Values are normalized to the CPU baseline (100 %). GPU reduces latency to ~5–70 %, NPU to ~1–20 %.

Unlocking the full power of NPU for LLM inference

The LiteRT QNN Accelerator delivers cutting‑edge performance with sophisticated LLMs. We benchmarked the FastVLM‑0.5B research model (a state‑of‑the‑art vision model) using LiteRT for both AOT compilation and on‑device NPU inference.

FastVLM performance table

Model is quantized to int8 weights and int16 activations, unlocking the NPU’s high‑speed int16 kernels.
Added special NPU kernels for transformer attention layers to the LiteRT QNN Accelerator.

Results on the Snapdragon 8 Elite Gen 5 NPU:

Time‑to‑first‑token (TTFT): 0.12 s for 1024×1024 images.
> 11 000 tokens / s for prefill, > 100 tokens / s for decode.

These numbers enable smooth, real‑time interactive experiences, demonstrated in a live scene‑understanding demo that processes and describes the surrounding world.

(video placeholder – original video not displayed)

Getting started in 3 steps

Step 1 (optional): AOT compilation for target SoCs

from ai_edge_litert.aot import aot_compile as aot_lib
from ai_edge_litert.aot.vendors.qualcomm import target as qnn_target

# Compile for all available SoCs
compiled_models = aot_lib.aot_compile("model.tflite")

# Or compile for a specific Qualcomm SoC (e.g., Snapdragon 8 Elite Gen5)
sm8850_target = qnn_target.Target(qnn_target.SocModel.SM8850)
compiled_models = aot_lib.aot_compile(
    "model.tflite",
    target=[sm8850_target]
)

Export the compiled models into a Google Play AI Pack:

from ai_edge_litert.aot.ai_pack import export_lib as ai_pack_export

ai_pack_export.export(
    compiled_models,
    ai_pack_dir="path/to/pack",
    ai_pack_name="my_ai_pack",
    litert_model_name="my_model"
)

See a full example in the LiteRT AOT compilation notebook.

Step 2: Deploy via Google Play for On‑device AI

Add the model (or AI Pack) to your Android project.

On‑device compilation – place the original .tflite file in assets/.
AOT compilation – copy the AI Pack into the project root and reference it in Gradle:

// settings.gradle.kts
include(":ai_pack:my_model")

// app/build.gradle.kts
android {
    assetPacks.add(":ai_pack:my_model")
}

Fetch the QNN libraries:

./litert_npu_runtime_libraries/fetch_qualcomm_library.sh

Add the runtime libraries as dynamic feature modules:

// settings.gradle.kts
include(":litert_npu_runtime_libraries:runtime_strings")
include(":litert_npu_runtime_libraries:qualcomm_runtime_v79")

// app/build.gradle.kts
android {
    dynamicFeatures.add(":litert_npu_runtime_libraries:qualcomm_runtime_v79")
}
dependencies {
    implementation(project(":litert_npu_runtime_libraries:runtime_strings"))
}

For a complete guide, see the Play for On‑device AI tutorial.

Step 3: Inference on NPU with LiteRT Runtime API

// Load model and initialize runtime (fallback to GPU if NPU unavailable)
val model = CompiledModel.create(
    context.assets,
    "model/mymodel.tflite",
    CompiledModel.Options(Accelerator.NPU, Accelerator.GPU)
)

// Pre‑allocate buffers
val inputBuffers = model.createInputBuffers()
val outputBuffers = model.createOutputBuffers()

// Fill input
inputBuffers[0].writeFloat(...)

// Run inference
model.run(inputBuffers, outputBuffers)

// Read output
val result = outputBuffers[0].readFloat()

Check out the image segmentation sample app for a complete example.

What’s next

The LiteRT Qualcomm AI Engine Direct (QNN) Accelerator bridges the gap between raw hardware potential and real‑world application performance. We’re excited to see what you’ll build with this power.

Explore the LiteRT DevSite and the LiteRT GitHub repository. Happy building!

Acknowledgements

Special thanks to the Google ODML team and Qualcomm team for their significant contributions:

Google ODML team: Alice Zheng, Advait Jain, Andrew Zhang, Arian Arfaian, Chintan Parikh, Chunlei Niu, Cormac Brick, Gerardo Carranza, Gregory Karpiak, Jingjiang Li, Jing Jin, Julius Kammerl, Lu Wang, Luke Boyer, Marissa Ikonomidis, Maria Lyubimtseva, Matt Kreileder, Matthias Grundmann, Na Li, Ping Yu, Quentin Khan, Rishika Sinha, Sachin Kotwani, Sebastian Schmidt, Steven Toribio, Teng‑Hui Zhu, Terry (Woncheol) Heoi, Vitalii Dziuba, Weiyi Wang, Yu‑Hui Chen, Zichuan We

Qualcomm LiteRT team: Alen Huang, Bastiaan Aarts, Brett Taylor, Chun‑Hsueh Lee (Jack), Chun‑Po Chang (Jerry), Chun‑Ting Lin (Graham), Felix Baum, Jiun‑Kai Yang (Kelvin), Krishna Sridhar, Ming‑Che Lin (Vincent), William Lin

Unlocking Peak Performance on Qualcomm NPU with LiteRT

Introducing LiteRT Qualcomm AI Engine Direct Accelerator

Key advantages for developers

Superior performance, real‑world results

Representative benchmark

Unlocking the full power of NPU for LLM inference

Getting started in 3 steps

Step 1 (optional): AOT compilation for target SoCs

Step 2: Deploy via Google Play for On‑device AI

Step 3: Inference on NPU with LiteRT Runtime API

What’s next

Acknowledgements

Related posts

5 things to try with Gemini 3 Pro in Gemini CLI

Announcing the Data Commons Gemini CLI extension

Build with Google Antigravity, our new agentic development platform

Building AI Agents with Google Gemini 3 and Open Source Frameworks

Introducing LiteRT Qualcomm AI Engine Direct Accelerator

Key advantages for developers

Superior performance, real‑world results

Representative benchmark

Unlocking the full power of NPU for LLM inference

Getting started in 3 steps

Step 1 (optional): AOT compilation for target SoCs

Step 2: Deploy via Google Play for On‑device AI

Step 3: Inference on NPU with LiteRT Runtime API

What’s next

Acknowledgements

Related posts

5 things to try with Gemini 3 Pro in Gemini CLI

Announcing the Data Commons Gemini CLI extension

Build with Google Antigravity, our new agentic development platform

Building AI Agents with Google Gemini 3 and Open Source Frameworks

Step 1 (optional): AOT compilation for target SoCs

Step 2: Deploy via Google Play for On‑device AI

Step 3: Inference on NPU with LiteRT Runtime API