Apple researchers develop on-device AI agent that interacts with apps for you
Source: 9to5Mac

Despite having just **3 billion parameters**, *Ferret‑UI Lite* matches or surpasses the benchmark performance of models up to **24 times larger**. Here are the details:
A Bit of Background on Ferret
In December 2023, a team of nine researchers published a study titled “FERRET: Refer and Ground Anything Anywhere at Any Granularity”. The paper introduced a multimodal large language model (MLLM) capable of understanding natural‑language references to specific parts of an image.

Since then, Apple has released a series of follow‑up papers expanding the Ferret family, including:
Ferret‑UI
The Ferret‑UI variants build on the original FERRET capabilities and address a shortcoming identified for general‑domain MLLMs: limited comprehension of user‑interface (UI) screens.
From the original Ferret‑UI paper:
Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet these general‑domain MLLMs often fall short in their ability to comprehend and interact effectively with user‑interface (UI) screens. In this paper, we present Ferret‑UI, a new MLLM tailored for enhanced understanding of mobile UI screens, equipped with referring, grounding, and reasoning capabilities. Given that UI screens typically exhibit a more elongated aspect ratio and contain smaller objects of interest (e.g., icons, text) than natural images, we incorporate “any resolution” on top of Ferret to magnify details and leverage enhanced visual features.

The original Ferret‑UI study showcased an interactive application where a user could converse with the model to learn how to operate the interface.
Ferret‑UI Lite
A few days ago Apple announced Ferret‑UI Lite: Lessons from Building Small On‑Device GUI Agents.
- Ferret‑UI was built on a 13 B‑parameter model focused on mobile UI understanding and fixed‑resolution screenshots.
- Ferret‑UI 2 expanded support to multiple platforms and higher‑resolution perception.
- Ferret‑UI Lite is a lightweight, on‑device model that remains competitive with much larger GUI agents.
Read the Ferret‑UI Lite paper here.
Ferret‑UI Lite
According to the researchers of the new paper, “the majority of existing methods of GUI agents … focus on large foundation models.” This is because “the strong reasoning and planning capabilities of large server‑side models allow these agentic systems to achieve impressive capabilities in diverse GUI navigation tasks.”
They note that, while there has been a lot of progress on both multi‑agent and end‑to‑end GUI systems—each taking different approaches to streamline the many tasks that involve agentic interaction with GUIs (low‑level GUI grounding, screen understanding, multi‑step planning, and self‑reflection)—these systems are too large and compute‑hungry to run well on‑device.
What is Ferret‑UI Lite?
Ferret‑UI Lite is a 3‑billion‑parameter variant of Ferret‑UI, built with several key components guided by insights on training small‑scale language models.
Key features
- Training data – Uses both real and synthetic data from multiple GUI domains.
- On‑the‑fly cropping & zooming – Inference‑time techniques that focus on specific GUI segments.
- Fine‑tuning – Supervised fine‑tuning combined with reinforcement‑learning methods.
The result is a model that matches or outperforms competing GUI‑agent models that are up to 24 × larger.

Real‑time Cropping & Zooming
The architecture (detailed in the paper) includes a noteworthy cropping pipeline:
- The model makes an initial prediction.
- It crops around the predicted region.
- It re‑predicts on the cropped image.
This helps a small model compensate for limited capacity to process many image tokens.

Self‑Generated Training Data
Ferret‑UI Lite also generates its own training data via a multi‑agent system that interacts with live GUI platforms:
- Curriculum task generator – Proposes goals of increasing difficulty.
- Planning agent – Breaks goals into steps.
- Grounding agent – Executes steps on‑screen.
- Critic model – Evaluates the outcomes.

This pipeline captures the fuzziness of real‑world interaction (errors, unexpected states, recovery strategies) that is hard to obtain from clean, human‑annotated data.
Evaluation
While earlier Ferret‑UI versions were evaluated on iPhone screenshots, Ferret‑UI Lite was trained and evaluated on Android, web, and desktop GUIs, using benchmarks such as AndroidWorld and OSWorld.
- Performs well on short‑horizon, low‑level tasks.
- Shows weaker performance on complex, multi‑step interactions—a trade‑off expected for a small, on‑device model.
Despite this, Ferret‑UI Lite offers a local, private agent that can autonomously interact with app interfaces without sending data to the cloud.
To learn more about the study, including benchmark breakdowns and results, follow this link.
Accessory Deals on Amazon
- AirPods Pro 3
- Apple AirTag 4‑Pack
- Beats USB‑C to USB‑C Woven Short Cable
- Wireless CarPlay Adapter
- Logitech MX Master 4


FTC: We use income‑earning auto‑affiliate links. More…
