[Paper] Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

Published: (February 23, 2026 at 01:59 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.20161v1

Overview

Mobile‑O is a lightweight vision‑language‑diffusion model that brings both visual understanding and image generation to a smartphone‑class device. By redesigning the cross‑modal conditioning pipeline, the authors achieve real‑time performance (≈3 s per 512×512 image on an iPhone) while matching or surpassing heavyweight academic baselines on standard benchmarks.

Key Contributions

  • Mobile Conditioning Projector (MCP) – a novel cross‑modal fusion block that uses depthwise‑separable convolutions and layer‑wise alignment to inject vision‑language context into a diffusion generator with minimal FLOPs.
  • Compact unified architecture – the entire model fits comfortably on mobile hardware (≈30 M parameters) and runs without any server‑side assistance.
  • Quadruplet post‑training scheme – a single fine‑tuning pass on (prompt, image, question, answer) tuples simultaneously improves generation quality and visual‑question‑answering (VQA) performance.
  • Data‑efficient training – the system is trained on only a few million image‑text pairs (vs. tens of millions for typical foundation models) yet reaches competitive scores.
  • Open‑source ecosystem – code, pretrained weights, mobile demo app, and a curated multimodal dataset are released for reproducibility and community extensions.

Methodology

  1. Backbone encoder – a mobile‑friendly vision transformer (e.g., ViT‑Tiny) extracts a spatial feature map from the input image.

  2. Text encoder – a lightweight transformer (≈6 M parameters) processes the prompt or question, producing a sequence of token embeddings.

  3. Mobile Conditioning Projector (MCP)

    • Aligns each vision token with the corresponding text token using layer‑wise cosine similarity to create a shared representation.
    • Applies depthwise‑separable 3×3 convolutions to fuse the aligned features, drastically reducing multiply‑add operations compared with full convolutions.
    • Outputs a conditioned latent that feeds directly into the diffusion decoder.
  4. Diffusion generator – a UNet‑style denoising network (scaled down to mobile size) receives the MCP‑conditioned latent and iteratively refines a noise tensor into the final image.

  5. Quadruplet fine‑tuning – the model is exposed to four‑element tuples:

    • Generation prompt → image synthesis loss (L₂ + perceptual).
    • Image → VQA loss (question → answer) using the same encoder‑decoder pipeline.

    This joint objective forces the shared parameters to serve both tasks, eliminating the need for separate heads.

All operations are implemented with Apple’s Core ML and TensorFlow Lite kernels, ensuring optimal use of the device’s Neural Engine and GPU.

Results & Findings

MetricMobile‑OShow‑OJanusFlow
GenEval (image generation)74 %69 %63 %
Avg. VQA accuracy (7 benchmarks)+15.3 % over Show‑O, +5.1 % over JanusFlow
Inference time (512×512)~3 s on iPhone 14 Pro18 s (≈6× slower)33 s (≈11× slower)
  • Despite using ≈10× fewer parameters and ≈5× less training data, Mobile‑O matches or exceeds the generation quality of larger models.
  • The MCP contributes the bulk of the speedup: ablating it leads to a 4.2× slowdown with negligible quality loss, confirming its efficiency‑first design.
  • The quadruplet fine‑tuning improves VQA scores by ~7 % without harming generation fidelity, demonstrating successful multi‑task sharing.

Practical Implications

DomainHow Mobile‑O Helps
On‑device AI apps (photo editors, AR filters)Real‑time text‑to‑image synthesis and instant visual Q&A without latency or privacy concerns of cloud calls.
Edge robotics / dronesGenerate contextual overlays (e.g., “show a map of the area”) while simultaneously interpreting visual cues on‑board.
Mobile gamingDynamically create assets or storyboards from player prompts, keeping the game lightweight and offline‑first.
Enterprise field toolsWorkers can ask “What part is damaged?” and receive annotated images instantly, boosting inspection workflows.
Research prototypingDevelopers can iterate on multimodal prompts locally, drastically shortening the feedback loop compared to server‑based pipelines.

Because the model runs entirely on the device, it sidesteps data‑privacy regulations (GDPR, HIPAA) and reduces bandwidth costs—critical for applications in remote or low‑connectivity environments.

Limitations & Future Work

  • Resolution ceiling – The current pipeline is tuned for 512×512 images; scaling to 1024×1024 would require either more memory or a multi‑stage upsampling strategy.
  • Domain coverage – Training data is limited to a few million general‑purpose image‑text pairs; niche domains (medical imaging, satellite imagery) may need additional fine‑tuning.
  • Hardware dependence – Performance numbers are reported on recent Apple silicon; older Android devices may see slower inference, suggesting a need for broader hardware benchmarking.
  • Prompt complexity – Very long or highly compositional prompts can degrade generation fidelity, indicating room for richer language modeling or hierarchical conditioning.

Future directions include integrating adapter‑style modules for domain‑specific extensions, exploring progressive diffusion to push resolution limits, and extending the MCP concept to audio‑visual multimodal tasks.

Mobile‑O demonstrates that unified multimodal intelligence is no longer the exclusive domain of cloud‑scale servers. By marrying efficient cross‑modal conditioning with a compact diffusion backbone, it opens the door for a new class of on‑device AI experiences.

Authors

  • Abdelrahman Shaker
  • Ahmed Heakl
  • Jaseel Muhammad
  • Ritesh Thawkar
  • Omkar Thawakar
  • Senmao Li
  • Hisham Cholakkal
  • Ian Reid
  • Eric P. Xing
  • Salman Khan
  • Fahad Shahbaz Khan

Paper Information

  • arXiv ID: 2602.20161v1
  • Categories: cs.CV
  • Published: February 23, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »