From PyTorch to Shipping local AI on Android

Published: (December 13, 2025 at 03:42 PM EST)
4 min read
Source: Dev.to

Source: Dev.to

Why On‑Device AI Is Hard

Modern Android apps rely on real‑time intelligence: pose detection in fitness apps, AR filters in social apps, on‑device audio processing, and live classification. Running these models locally gives speed, privacy, and offline capability, but it also introduces a major challenge: consistent performance across the enormous variety of Android devices.

A recent conversation with an Android developer highlighted the problem. He trained a MobileNet‑based model in PyTorch, converted it to TFLite, and verified it on three phones (Pixel 7, Galaxy S21, and a mid‑range Motorola). After release, users on other devices reported sluggish performance, unstable frame rates, and crashes before inference even began. This pattern—models that work on a few phones but break on many others—is one of the most common issues developers face.

Hardware Diversity and Performance Variability

Two devices released in the same year can behave completely differently when running the same model:

  • CPU – runs anything but rarely meets real‑time needs.
  • GPU – faster, but performance depends heavily on the runtime delegate (TFLite GPU, NNAPI, Vulkan).
  • NPU – fastest, but only for models correctly adapted and compiled for that chipset.

Accelerators and drivers vary widely in supported operations and precisions. Runtime delegates often choose different compute paths, so the same model may execute through completely different routes on two phones, leading to noticeable differences in stability and latency.

Toolchain Complexity

Exporting a model from PyTorch → ONNX → TFLite is just the first step. Many hardware vendors provide their own delegates, runtimes, and SDKs, each with subtle quirks:

  • Setting up TFLite GPU delegates, NNAPI, or vendor‑specific runtimes (e.g., Qualcomm, Google Tensor) requires experimentation.
  • Error messages are often vague, making it hard to determine whether an issue stems from an unsupported operator, precision mismatch (e.g., FP32 on an INT8‑only accelerator), or missing hardware acceleration.

Real‑World Constraints

Even when a model runs, phones impose strict limits:

  • Thermal budget – heavy models can overheat and trigger throttling.
  • Battery drain – power‑hungry inference leads to quick uninstalls.
  • Memory – older or low‑end phones have limited RAM and weaker accelerators, preventing some models from running at all.

Balancing accuracy, latency, and power consumption often requires careful optimization and device‑specific tuning.

Embedl Hub: A Solution

To address these challenges we built Embedl Hub, a platform that helps you:

  • Compile models for the correct runtime and accelerators on target devices.
  • Optimize latency, memory usage, and energy consumption, including NPU acceleration.
  • Benchmark models on real edge hardware in the cloud, measuring device‑specific latency, memory, and execution paths.

All metrics, parameters, and benchmarks are logged and displayed in a web UI, allowing you to inspect layer‑level behavior, compare devices side‑by‑side, and reproduce every run. This makes it easy to choose the best model‑device combination before releasing your app.

Platform Overview

  1. Compilation – Convert your model to the appropriate format and target runtime.
  2. Quantization (optional but recommended) – Reduce precision (e.g., INT8) to lower latency and power usage.
  3. Benchmarking – Run the model on a fleet of Android devices in the cloud and collect detailed metrics.

Demo: Optimizing MobileNetV2 on a Samsung Galaxy S24

Suppose you want to run a MobileNetV2 model trained in PyTorch.

1. Export to ONNX and compile for LiteRT (TFLite)

embedl-hub compile \
    --model /path/to/mobilenet_v2.onnx \

This step quickly reveals whether the model is compatible with the device’s chipset and execution paths, catching issues that would otherwise surface only after launch.

Quantization lowers the numerical precision of weights and activations, dramatically reducing inference latency and memory usage. It is especially useful for NPU acceleration on modern Android devices.

embedl-hub quantize \
    --model /path/to/mobilenet_v2.tflite \
    --data /path/to/dataset \
    --num-samples 100

A small calibration dataset (a few hundred examples) helps minimize accuracy loss while ensuring the model runs efficiently on resource‑constrained hardware.

3. Benchmark in the cloud

After compilation and quantization, submit the model to Embedl Hub’s benchmarking service. The platform runs the model on real Samsung Galaxy S24 devices (and any other devices you select), reporting:

  • Per‑layer latency
  • Memory footprint
  • Power consumption estimates
  • Execution path (CPU, GPU, NPU)

You can then compare results across devices, iterate on optimizations, and lock in the configuration that meets your real‑time requirements.

Takeaway

Running AI on Android devices demands more than just converting a model. You must navigate diverse hardware, complex toolchains, and strict thermal and power budgets. Embedl Hub streamlines this process by providing automated compilation, quantization, and cloud‑based benchmarking, giving you confidence that your on‑device AI will perform reliably across the fragmented Android ecosystem.

Back to Blog

Related posts

Read more »