Speculative Decoding on Mobile GPUs
Source: Dev.to
title: “Speculative Decoding on Mobile GPUs: Draft-Verify LLM Pipelines with Vulkan Compute” published: true description: “Build a speculative decoding pipeline on Android using Vulkan compute shaders for draft models and NNAPI for verification, with adaptive batch scheduling.” tags: android, kotlin, architecture, performance canonical_url: https://blog.mvpfactory.co/speculative-decoding-mobile-gpus-vulkan-compute
What We Are Building
In this workshop, we are going to wire up a speculative decoding pipeline that runs entirely on-device on Android. A small ~150M parameter draft model will propose candidate tokens using Vulkan compute shaders, while a larger 3-7B verify model accepts or rejects them through NNAPI — all coordinated by a dynamic batch scheduler that adapts to thermal state and memory pressure.
The result: 2-3x lower per-token latency, pushing sub-200ms generation on flagship Android hardware. Let me show you how the pieces fit together.
Prerequisites
- Android device with Vulkan 1.1+ compute support (2019 SoCs or newer)
- Android 12+ for
PowerManager.getThermalHeadroom()API - Familiarity with Kotlin and basic Vulkan concepts
- A quantized draft model (int4) and verify model (int8)
Step 1: Understand the Split Architecture
Most teams get this wrong by running both models through the same accelerator. Split the pipeline instead.
| Component | Accelerator | Why |
|---|---|---|
| Draft model (~150M params) | Vulkan compute shaders | Direct GPU control, custom quantization kernels, no NNAPI overhead |
| Verify model (~3-7B params) | NNAPI (delegates to NPU/GPU) | Hardware-optimized int8/int4, vendor-tuned kernels |
| Batch scheduler | CPU | Lightweight coordinator, thermal/memory monitoring |
| KV-cache management | Shared GPU memory | Vulkan buffer exports via VK_KHR_external_memory |
A 7B model running autoregressively on a Snapdragon 8 Gen 3 generates roughly 8-12 tokens/second. With speculative decoding at K=5, server GPUs see 70-85% acceptance rates. The algorithm works. The engineering challenge is orchestrating two models across heterogeneous compute units without melting the phone.
Step 2: Build the Vulkan Draft Pipeline
Here is the minimal setup to get this working. Custom GLSL compute shaders handle quantized matrix multiplications — 4-bit weights with fp16 accumulation hits the sweet spot for mobile GPU ALUs.
kotlin fun proposeCandidates(inputTokenId: Int): IntArray { val candidates = IntArray(specDepth) var currentToken = inputTokenId
for (i in 0 until specDepth) {
bindDescriptorSets(currentToken, kvCache)
vkCmdDispatch(commandBuffer, workgroupsX, 1, 1)
candidates[i] = readArgmaxFromBuffer()
currentToken = candidates[i]
}
return candidates
}
}
Step 3: Wire Up the Adaptive Batch Scheduler
Here is a pattern I use in every project that involves on-device inference. You cannot run speculation depth K=8 when the device is thermal throttling at 45°C. The scheduler must adapt.
kotlin return when { thermalHeadroom 1 // near throttle: no speculation memoryAvailable 2 // memory-constrained thermalHeadroom 3 // warm but manageable else -> 6 // full speculation } }
}
The scheduler polls PowerManager.getThermalHeadroom() on Android 12+ and reads /sys/class/thermal/ zones as a fallback. GPU memory pressure comes from Vulkan’s vkGetPhysicalDeviceMemoryBudgetPropertiesEXT.
On a Pixel 8 Pro, I measured the following thermal-adaptive behavior:
| Thermal State | Spec Depth | Tokens/sec | Acceptance Rate |
|---|---|---|---|
| Cool (42°C) | 1 | 9-11 | N/A (no speculation) |
Step 4: Solve Zero-Copy KV-Cache Sharing
Both models need access to the key-value cache. The draft model builds speculative KV entries in Vulkan buffers. When the verify model accepts tokens, those entries become canonical. When it rejects, you roll back.
Use VK_KHR_external_memory_fd to export Vulkan buffers as file descriptors, then import them into NNAPI via ANeuralNetworksMemory_createFromFd. On a Snapdragon 8 Gen 3, a 512MB KV-cache copy costs ~8ms — that would erase most of your speculation benefit. In my benchmarks, this single zero-copy optimization was worth 15-20% throughput improvement.
Gotchas
Here is the gotcha that will save you hours:
- Pre-2019 SoCs lack Vulkan 1.1 compute support entirely. The draft pipeline simply will not run. Check capabilities at startup and fall back gracefully.
- NNAPI delegation is vendor-dependent. Some NPU delegates reject model topologies silently. The docs do not mention this, but you will need logging at every delegation step to catch silent failures.
- Memory budget is tighter than you think. Devices with 6GB RAM leave roughly 1.5-2GB for both models after Android’s runtime takes its share. You need aggressive quantization: int4 for the draft model, int8 for the verifier. There is no way around it.
- Static speculation depth is a trap. Build thermal-aware scheduling from day one. A fixed K will either waste thermals or leave performance on the table.
Wrapping Up
The split-compute architecture — Vulkan for drafting, NNAPI for verification — is the only way to get parallel model execution on mobile. If you are doing on-device inference and have not explored this pattern yet, start with the Vulkan draft pipeline. It has the steepest learning curve, and everything else builds on top of it.
Build the scheduler early, invest in zero-copy KV-cache sharing, and respect the thermal envelope. That is how you get to 22+ tokens/second on a phone.