Google’s latest trick gets Gemma 4 running 3x faster right on your phone

Published: 5 days ago (May 6, 2026 at 05:10 AM EDT)

2 min read

Source: Android Authority

TL;DR

Google has introduced new assistant models, called “drafters,” that could significantly speed up Gemma 4.
Drafters work by predicting sections of prompts to the main model, which can focus on processing them in bigger batches.
This allows the model to use the memory and the compute more efficiently.

Google’s recently launched Gemma 4 edge AI models are especially designed to run locally on consumer‑hosted hardware. While favorable from a privacy standpoint, local models can easily hog resources and slow down results, rendering them ineffective. So, Google is now offering a potential solution, which it claims can speed up Gemma 4 models by up to three times.

Google recently released Multi‑Token Prediction (MTP) drafters for Gemma 4. These drafters are essentially smaller, assistive models that help the primary model by “predicting” part of the user’s request. These smaller models also work in parallel to the main model to manage the compute more effectively.

How does MTP improve Gemma 4?

The process uses a technique called “Speculative Decoding,” in which the drafter models predict upcoming words in the prompt even before the main Gemma model has read through it. While the drafter moves on to the next sequence of words, the main model verifies the predicted set of words at the same time.

If the model accepts the drafted version, it moves on to verify the next set.
If it disagrees, it replaces the incorrect word or chunk.

While the extra work may sound counterintuitive, it’s actually not. An oversimplified explanation of why MTP works:

The speed of processing is not just determined by the processing hardware (typically GPU cores) but by the memory bandwidth (VRAM).
The model must be referenced with each new request, so by combining multiple words into a single chunk, the model is referenced only once rather than multiple times, shifting the load from memory to the processing unit.

In addition to these changes, Google says it is also working to optimize Gemma 4 models of different weights for specific hardware, such as Apple Silicon or the popular Nvidia A100.

The MTP drafters for Gemma 4, alongside the primary model, can be accessed via platforms such as Hugging Face or Kaggle, tools like Ollama, or through Google’s own AI Edge Gallery on Android or iOS.

Google’s latest trick gets Gemma 4 running 3x faster right on your phone

TL;DR

How does MTP improve Gemma 4?

Related posts

Course correction: Google to link more sources in AI Overviews

🔬 AI for Scientific Discovery in the Real World: What Gemma 4 Changes The Moment AI Leaves the Chat Window

Google Gemma 4: My Honest Experience as a Developer (And Why I’m Not Going Back to Cloud-Only AI)

Google may already be testing a ‘deeply integrated’ agentic solution to take on OpenClaw

TL;DR

How does MTP improve Gemma 4?

Related posts

Course correction: Google to link more sources in AI Overviews

🔬 AI for Scientific Discovery in the Real World: What Gemma 4 Changes The Moment AI Leaves the Chat Window

Google Gemma 4: My Honest Experience as a Developer (And Why I’m Not Going Back to Cloud-Only AI)

Google may already be testing a ‘deeply integrated’ agentic solution to take on OpenClaw

How does MTP improve Gemma 4?