Apple researchers develop on-device AI agent that interacts with apps for you

Published: 2 days ago (February 20, 2026 at 06:02 PM EST)

4 min read

Source: 9to5Mac

![Reddit is being spammed by AI bots, and it's all Reddit's fault – Conceptual image of a row of AI‑powered robots](https://9to5mac.com/wp-content/uploads/sites/6/2025/06/Reddit-is-being-spammed-by-AI-bots-and-its-all-Reddits-fault.jpg?quality=82&strip=all&w=1600)

Despite having just **3 billion parameters**, *Ferret‑UI Lite* matches or surpasses the benchmark performance of models up to **24 times larger**. Here are the details:

A Bit of Background on Ferret

In December 2023, a team of nine researchers published a study titled “FERRET: Refer and Ground Anything Anywhere at Any Granularity”. The paper introduced a multimodal large language model (MLLM) capable of understanding natural‑language references to specific parts of an image.

Apple Ferret example

Since then, Apple has released a series of follow‑up papers expanding the Ferret family, including:

Ferret‑UI

The Ferret‑UI variants build on the original FERRET capabilities and address a shortcoming identified for general‑domain MLLMs: limited comprehension of user‑interface (UI) screens.

From the original Ferret‑UI paper:

Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet these general‑domain MLLMs often fall short in their ability to comprehend and interact effectively with user‑interface (UI) screens. In this paper, we present Ferret‑UI, a new MLLM tailored for enhanced understanding of mobile UI screens, equipped with referring, grounding, and reasoning capabilities. Given that UI screens typically exhibit a more elongated aspect ratio and contain smaller objects of interest (e.g., icons, text) than natural images, we incorporate “any resolution” on top of Ferret to magnify details and leverage enhanced visual features.

Ferret‑UI example

The original Ferret‑UI study showcased an interactive application where a user could converse with the model to learn how to operate the interface.

Ferret‑UI Lite

A few days ago Apple announced Ferret‑UI Lite: Lessons from Building Small On‑Device GUI Agents.

Ferret‑UI was built on a 13 B‑parameter model focused on mobile UI understanding and fixed‑resolution screenshots.
Ferret‑UI 2 expanded support to multiple platforms and higher‑resolution perception.
Ferret‑UI Lite is a lightweight, on‑device model that remains competitive with much larger GUI agents.

Read the Ferret‑UI Lite paper here.

Ferret‑UI Lite

According to the researchers of the new paper, “the majority of existing methods of GUI agents … focus on large foundation models.” This is because “the strong reasoning and planning capabilities of large server‑side models allow these agentic systems to achieve impressive capabilities in diverse GUI navigation tasks.”

They note that, while there has been a lot of progress on both multi‑agent and end‑to‑end GUI systems—each taking different approaches to streamline the many tasks that involve agentic interaction with GUIs (low‑level GUI grounding, screen understanding, multi‑step planning, and self‑reflection)—these systems are too large and compute‑hungry to run well on‑device.

What is Ferret‑UI Lite?

Ferret‑UI Lite is a 3‑billion‑parameter variant of Ferret‑UI, built with several key components guided by insights on training small‑scale language models.

Key features

Training data – Uses both real and synthetic data from multiple GUI domains.
On‑the‑fly cropping & zooming – Inference‑time techniques that focus on specific GUI segments.
Fine‑tuning – Supervised fine‑tuning combined with reinforcement‑learning methods.

The result is a model that matches or outperforms competing GUI‑agent models that are up to 24 × larger.

Ferret‑UI Lite screenshot

Real‑time Cropping & Zooming

The architecture (detailed in the paper) includes a noteworthy cropping pipeline:

The model makes an initial prediction.
It crops around the predicted region.
It re‑predicts on the cropped image.

This helps a small model compensate for limited capacity to process many image tokens.

Cropping pipeline illustration

Self‑Generated Training Data

Ferret‑UI Lite also generates its own training data via a multi‑agent system that interacts with live GUI platforms:

Curriculum task generator – Proposes goals of increasing difficulty.
Planning agent – Breaks goals into steps.
Grounding agent – Executes steps on‑screen.
Critic model – Evaluates the outcomes.

Training pipeline diagram

This pipeline captures the fuzziness of real‑world interaction (errors, unexpected states, recovery strategies) that is hard to obtain from clean, human‑annotated data.

Evaluation

While earlier Ferret‑UI versions were evaluated on iPhone screenshots, Ferret‑UI Lite was trained and evaluated on Android, web, and desktop GUIs, using benchmarks such as AndroidWorld and OSWorld.

Performs well on short‑horizon, low‑level tasks.
Shows weaker performance on complex, multi‑step interactions—a trade‑off expected for a small, on‑device model.

Despite this, Ferret‑UI Lite offers a local, private agent that can autonomously interact with app interfaces without sending data to the cloud.

To learn more about the study, including benchmark breakdowns and results, follow this link.

Accessory Deals on Amazon

AirPods Pro 3
Apple AirTag 4‑Pack
Beats USB‑C to USB‑C Woven Short Cable
Wireless CarPlay Adapter
Logitech MX Master 4

FTC: We use income‑earning auto‑affiliate links. More…

Apple researchers develop on-device AI agent that interacts with apps for you

A Bit of Background on Ferret

Ferret‑UI

Ferret‑UI Lite

Ferret‑UI Lite

What is Ferret‑UI Lite?

Real‑time Cropping & Zooming

Self‑Generated Training Data

Evaluation

Accessory Deals on Amazon

Related posts

Does AI have a hero gene?

Anthropic accuses three Chinese AI labs of abusing Claude to improve their own models

4 AI Models (That aren’t Opus 4.6) on Our Minds This Week

Guide Labs debuts a new kind of interpretable LLM