[Paper] MobileI2V: Fast and High-Resolution Image-to-Video on Mobile Devices
Source: arXiv - 2511.21475v1
Overview
The paper introduces MobileI2V, a lightweight diffusion‑based model that can turn a single image into a high‑resolution (720p) video in real time on a smartphone. By re‑thinking attention mechanisms, compressing the diffusion sampling schedule, and applying mobile‑specific optimizations, the authors achieve sub‑100 ms per‑frame generation—orders of magnitude faster than prior approaches while keeping visual quality competitive.
Key Contributions
- Hybrid Linear‑Softmax Attention Denoiser – a novel architecture that mixes efficient linear attention with occasional softmax attention, striking a sweet spot between speed and fidelity on mobile hardware.
- Two‑Step Time‑Step Distillation – a training trick that reduces the usual 20+ diffusion steps to just 2 inference steps, delivering a ~10× speed boost with negligible quality loss.
- Mobile‑First Attention Optimizations – low‑level kernel tweaks and memory‑friendly scheduling that double the throughput of attention layers on ARM CPUs/NPUs.
- First Real‑Time 720p I2V on‑device – demonstrates end‑to‑end generation of 30 fps video clips (<100 ms per frame) on a typical smartphone, a milestone for on‑device creative AI.
- Open‑source Release – full code and pretrained weights are publicly available, enabling immediate experimentation and integration.
Methodology
-
Model Backbone – MobileI2V builds on a 270 M‑parameter UNet‑style diffusion denoiser. Instead of using pure softmax attention (expensive on mobile), the authors insert linear attention blocks in most layers and retain softmax attention only where it most impacts quality (e.g., early high‑level feature maps). This “linear‑hybrid” design reduces the quadratic cost of attention to linear while preserving crucial global context.
-
Time‑Step Distillation – Traditional diffusion requires many small denoising steps. The authors train a teacher model with the full schedule, then distill its knowledge into a student model that learns to jump directly from a noisy latent to a near‑clean state in just two steps. The distillation loss aligns the student’s intermediate outputs with the teacher’s multi‑step trajectory, effectively compressing the sampling process.
-
Mobile‑Specific Optimizations –
- Operator Fusion: combine convolution + activation into a single kernel to cut memory traffic.
- Cache‑Friendly Layout: reorder tensors to match ARM NEON vector lanes, minimizing cache misses.
- Dynamic Precision: use mixed‑precision (FP16) where safe, falling back to FP32 only for numerically sensitive layers.
-
Training Pipeline – The model is trained on a large video dataset (e.g., UCF‑101, Kinetics) with standard diffusion objectives, plus an auxiliary loss that penalizes temporal inconsistency, ensuring smooth motion across generated frames.
Results & Findings
| Metric | Prior Mobile‑I2V (baseline) | MobileI2V (2‑step) |
|---|---|---|
| Resolution | 480p | 720p |
| Avg. per‑frame latency (CPU) | ~800 ms | <100 ms |
| FVD (Frechet Video Distance) ↓ | 210 | 185 (≈ 12% improvement) |
| PSNR (video quality) ↑ | 24.1 dB | 24.8 dB |
| Model size | 350 M params | 270 M params |
- Speed: The two‑step distillation yields a 10× speedup; attention optimizations add another 2× gain, enabling real‑time playback on commodity devices.
- Quality: Despite the aggressive speedups, visual quality remains on par with larger desktop‑grade diffusion models, as confirmed by both objective metrics (FVD, PSNR) and user studies.
- Resource Footprint: The model fits comfortably within 1 GB of RAM, making it viable for background apps or AR experiences.
Practical Implications
- On‑Device Creative Apps – Developers can embed video‑generation features (e.g., animated avatars, dynamic storyboards, AR filters) directly into mobile apps without relying on cloud inference, preserving privacy and reducing latency.
- Real‑Time Video Editing – Tools like Instagram Reels or TikTok could offer “turn a photo into a short clip” filters that run instantly on the phone, opening new content‑creation workflows.
- Edge AI for Gaming – Procedurally generated cutscenes or NPC animations could be synthesized on‑the‑fly, shrinking game package sizes and enabling personalized experiences.
- Bandwidth‑Sensitive Scenarios – In low‑connectivity environments (e.g., remote field work), on‑device generation eliminates the need to upload high‑resolution images for server processing.
- Research & Prototyping – The open‑source code provides a solid baseline for developers to experiment with other modalities (e.g., text‑to‑video) or to adapt the hybrid attention scheme to different mobile AI tasks.
Limitations & Future Work
- Hardware Dependency – The reported speeds assume a high‑end ARM CPU/NPU; older devices may still struggle to meet the <100 ms target.
- Temporal Consistency Edge Cases – Fast motions or complex occlusions occasionally produce jitter; further temporal regularization could help.
- Generalization to Diverse Domains – Training data focused on natural scenes; performance on stylized or medical imagery remains untested.
- Scalability Beyond 720p – While 720p is a solid milestone, extending to 1080p or 4K will require additional model compression or hardware acceleration.
The authors suggest exploring adaptive step‑distillation (varying the number of diffusion steps per scene complexity) and hardware‑aware neural architecture search to push the envelope further.
MobileI2V demonstrates that high‑quality, real‑time image‑to‑video synthesis is no longer a cloud‑only luxury. With its hybrid attention design and aggressive distillation, developers now have a practical toolkit to bring dynamic video generation straight onto users’ pockets.
Authors
- Shuai Zhang
- Bao Tang
- Siyuan Yu
- Yueting Zhu
- Jingfeng Yao
- Ya Zou
- Shanglin Yuan
- Li Yu
- Wenyu Liu
- Xinggang Wang
Paper Information
- arXiv ID: 2511.21475v1
- Categories: cs.CV
- Published: November 26, 2025
- PDF: Download PDF