[Paper] Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

Published: 3 days ago (February 23, 2026 at 01:59 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.20161v1

Overview

Mobile‑O is a lightweight vision‑language‑diffusion model that brings both visual understanding and image generation to a smartphone‑class device. By redesigning the cross‑modal conditioning pipeline, the authors achieve real‑time performance (≈3 s per 512×512 image on an iPhone) while matching or surpassing heavyweight academic baselines on standard benchmarks.

Key Contributions

Mobile Conditioning Projector (MCP) – a novel cross‑modal fusion block that uses depthwise‑separable convolutions and layer‑wise alignment to inject vision‑language context into a diffusion generator with minimal FLOPs.
Compact unified architecture – the entire model fits comfortably on mobile hardware (≈30 M parameters) and runs without any server‑side assistance.
Quadruplet post‑training scheme – a single fine‑tuning pass on (prompt, image, question, answer) tuples simultaneously improves generation quality and visual‑question‑answering (VQA) performance.
Data‑efficient training – the system is trained on only a few million image‑text pairs (vs. tens of millions for typical foundation models) yet reaches competitive scores.
Open‑source ecosystem – code, pretrained weights, mobile demo app, and a curated multimodal dataset are released for reproducibility and community extensions.

Methodology

Backbone encoder – a mobile‑friendly vision transformer (e.g., ViT‑Tiny) extracts a spatial feature map from the input image.
Text encoder – a lightweight transformer (≈6 M parameters) processes the prompt or question, producing a sequence of token embeddings.
Mobile Conditioning Projector (MCP)
- Aligns each vision token with the corresponding text token using layer‑wise cosine similarity to create a shared representation.
- Applies depthwise‑separable 3×3 convolutions to fuse the aligned features, drastically reducing multiply‑add operations compared with full convolutions.
- Outputs a conditioned latent that feeds directly into the diffusion decoder.
Diffusion generator – a UNet‑style denoising network (scaled down to mobile size) receives the MCP‑conditioned latent and iteratively refines a noise tensor into the final image.
Quadruplet fine‑tuning – the model is exposed to four‑element tuples:
- Generation prompt → image synthesis loss (L₂ + perceptual).
- Image → VQA loss (question → answer) using the same encoder‑decoder pipeline.
This joint objective forces the shared parameters to serve both tasks, eliminating the need for separate heads.

All operations are implemented with Apple’s Core ML and TensorFlow Lite kernels, ensuring optimal use of the device’s Neural Engine and GPU.

Results & Findings

Metric	Mobile‑O	Show‑O	JanusFlow
GenEval (image generation)	74 %	69 %	63 %
Avg. VQA accuracy (7 benchmarks)	+15.3 % over Show‑O, +5.1 % over JanusFlow
Inference time (512×512)	~3 s on iPhone 14 Pro	18 s (≈6× slower)	33 s (≈11× slower)

Despite using ≈10× fewer parameters and ≈5× less training data, Mobile‑O matches or exceeds the generation quality of larger models.
The MCP contributes the bulk of the speedup: ablating it leads to a 4.2× slowdown with negligible quality loss, confirming its efficiency‑first design.
The quadruplet fine‑tuning improves VQA scores by ~7 % without harming generation fidelity, demonstrating successful multi‑task sharing.

Practical Implications

Domain	How Mobile‑O Helps
On‑device AI apps (photo editors, AR filters)	Real‑time text‑to‑image synthesis and instant visual Q&A without latency or privacy concerns of cloud calls.
Edge robotics / drones	Generate contextual overlays (e.g., “show a map of the area”) while simultaneously interpreting visual cues on‑board.
Mobile gaming	Dynamically create assets or storyboards from player prompts, keeping the game lightweight and offline‑first.
Enterprise field tools	Workers can ask “What part is damaged?” and receive annotated images instantly, boosting inspection workflows.
Research prototyping	Developers can iterate on multimodal prompts locally, drastically shortening the feedback loop compared to server‑based pipelines.

Because the model runs entirely on the device, it sidesteps data‑privacy regulations (GDPR, HIPAA) and reduces bandwidth costs—critical for applications in remote or low‑connectivity environments.

Limitations & Future Work

Resolution ceiling – The current pipeline is tuned for 512×512 images; scaling to 1024×1024 would require either more memory or a multi‑stage upsampling strategy.
Domain coverage – Training data is limited to a few million general‑purpose image‑text pairs; niche domains (medical imaging, satellite imagery) may need additional fine‑tuning.
Hardware dependence – Performance numbers are reported on recent Apple silicon; older Android devices may see slower inference, suggesting a need for broader hardware benchmarking.
Prompt complexity – Very long or highly compositional prompts can degrade generation fidelity, indicating room for richer language modeling or hierarchical conditioning.

Future directions include integrating adapter‑style modules for domain‑specific extensions, exploring progressive diffusion to push resolution limits, and extending the MCP concept to audio‑visual multimodal tasks.

Mobile‑O demonstrates that unified multimodal intelligence is no longer the exclusive domain of cloud‑scale servers. By marrying efficient cross‑modal conditioning with a compact diffusion backbone, it opens the door for a new class of on‑device AI experiences.

Authors

Abdelrahman Shaker
Ahmed Heakl
Jaseel Muhammad
Ritesh Thawkar
Omkar Thawakar
Senmao Li
Hisham Cholakkal
Ian Reid
Eric P. Xing
Salman Khan
Fahad Shahbaz Khan

Paper Information

arXiv ID: 2602.20161v1
Categories: cs.CV
Published: February 23, 2026
PDF: Download PDF

[Paper] Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Neu-PiG: Neural Preconditioned Grids for Fast Dynamic Surface Reconstruction on Long Sequences

[Paper] WHOLE: World-Grounded Hand-Object Lifted from Egocentric Videos

[Paper] Solaris: Building a Multiplayer Video World Model in Minecraft

[Paper] Off-The-Shelf Image-to-Image Models Are All You Need To Defeat Image Protection Schemes