Apple working to cram massive Gemini model into iPhone to power new Siri
Source: Ars Technica
It’s impossible to totally avoid generative AI when interacting with technology anymore, but Apple has a bit less of it. That’s not entirely by choice, though. The iPhone maker has delayed the AI‑enhanced Siri multiple times since first promising it in 2024, but a deal with Google will merge the iconic assistant with Gemini later this year. As we approach the Worldwide Developers Conference, Apple has been working to bring big AI smarts to the modest processing environment of a smartphone. Apple fans may not like the outcome, however.
Apple has long emphasized the privacy value of running AI locally, but a new report suggests that the iPhone’s Gemini makeover will rely heavily on Google and Nvidia in the cloud. The Information reports that Apple’s Gemini‑infused Siri will run both on‑device and in the cloud, an apparent reversal of its privacy‑focused preference for local AI.
With every new chip announcement, Apple highlights how its silicon is optimized for AI—especially through Neural Engine upgrades. While smartphones are getting more capable NPUs, they still lack the RAM needed to keep enormous models in memory. Even the GPUs in most phones can process more AI tokens than the AI‑focused NPUs, but they are not a substitute for the massive parameter counts of state‑of‑the‑art models.
The largest AI models today have trillions of parameters, whereas on‑device models are limited to a few billion at most. To fit on a phone, these models are “quantized” to run at lower precision, which speeds up inference but can reduce the accuracy of token generation. Consequently, on‑device AIs often feel less smart than their cloud counterparts, and even the biggest cloud models can produce subpar results.
The amazing, shrinking Gemini
Google offers mobile‑optimized versions of Gemini, branded as Gemini Nano. These are intended for contextual features like Magic Cue and audio summarization. Siri, however, is meant to be a conversational assistant—users talk to it and expect it to perform actions. That requires a different kind of model. On Android, Google typically routes Gemini queries straight to the cloud rather than attempting full on‑device execution.