[Paper] NanoFLUX: Distillation-Driven Compression of Large Text-to-Image Generation Models for Mobile Devices

Published: 3 days ago (February 6, 2026 at 12:05 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2602.06879v1

Overview

The paper presents NanoFLUX, a compact 2.4 B‑parameter text‑to‑image generation model that runs on modern smartphones in just a few seconds. By distilling a 17 B‑parameter diffusion model (FLUX.1‑Schnell) and applying a series of clever compression tricks, the authors bridge the gap between state‑of‑the‑art visual synthesis and on‑device deployment.

Key Contributions

Progressive model compression that trims the diffusion transformer from 12 B to ~2 B parameters while keeping visual fidelity.
ResNet‑based token down‑sampling, enabling early transformer blocks to work on lower‑resolution token maps, cutting latency without sacrificing final image quality.
Text‑encoder distillation that injects visual cues from the denoiser’s early layers into the language representation, improving text‑image alignment.
Real‑world benchmark: 512 × 512 images generated in ~2.5 s on a typical mobile device, a first for high‑resolution diffusion models on‑device.

Methodology

Teacher‑Student Distillation – The large FLUX.1‑Schnell model (the “teacher”) generates reference images and intermediate denoising features. A smaller “student” network learns to mimic both the final output and the intermediate dynamics, preserving the diffusion process’s expressive power.
Transformer Pruning – Redundant attention heads and feed‑forward dimensions are identified via sensitivity analysis and removed. This reduces the transformer’s parameter count from 12 B to roughly 2 B while keeping the most informative pathways.
ResNet Token Down‑Sampler – Before the first few transformer layers, a lightweight ResNet reduces the spatial token resolution (e.g., from 64 × 64 to 32 × 32). Later layers upscale the tokens back, allowing the bulk of computation to happen on a smaller representation.
Cross‑Modal Text Encoder Distillation – The text encoder is trained not only on language data but also to predict visual features extracted from early denoiser layers. This aligns textual embeddings more closely with the visual generation pipeline, improving prompt adherence.
Progressive Fine‑Tuning – After each compression step, the model is fine‑tuned on the original diffusion training set to recover any lost quality, resulting in a smooth “compression ladder” from teacher to final student.

Results & Findings

Speed: 512 × 512 image generation in ≈2.5 s on a flagship Android phone (Snapdragon 8 Gen 2) using a single GPU core.
Quality: Human evaluation and CLIP‑based similarity scores show only a ≈5 % drop compared to the 17 B teacher, which is barely perceptible for most consumer use‑cases.
Parameter Efficiency: The final model occupies ~2 GB of storage (compressed) and fits comfortably within typical mobile memory budgets.
Ablation: Removing the token down‑sampler increases latency by ~40 % with negligible quality gain; omitting text‑encoder distillation leads to a noticeable decline in prompt fidelity (≈12 % lower CLIP‑score).

Practical Implications

On‑Device Creative Apps – Developers can embed high‑resolution text‑to‑image generation directly into photo editors, AR filters, or social‑media stickers without relying on cloud APIs, reducing latency and preserving user privacy.
Edge‑AI Services – Enterprises can deploy personalized content generation (e.g., marketing visuals, product mock‑ups) on edge devices, lowering bandwidth costs and enabling offline operation.
Rapid Prototyping – The compression pipeline can be adapted to other diffusion models (e.g., video or 3‑D generation), offering a roadmap for bringing more generative AI capabilities to the edge.
Energy Efficiency – Running locally avoids the energy overhead of data transmission to servers, which is especially valuable for battery‑constrained devices.

Limitations & Future Work

Hardware Dependency – The reported 2.5 s latency assumes a high‑end mobile GPU; lower‑tier devices will see slower performance.
Resolution Ceiling – While 512 × 512 is impressive, scaling to 1024 × 1024 still requires cloud resources.
Generalization – The model was distilled on the same data distribution as FLUX.1‑Schnell; performance on out‑of‑domain prompts may degrade.
Future Directions – The authors suggest exploring quantization‑aware training, mixed‑precision inference, and extending the token‑down‑sampling concept to multi‑modal diffusion pipelines (e.g., text‑to‑video).

Authors

Ruchika Chavhan
Malcolm Chadwick
Alberto Gil Couto Pimentel Ramos
Luca Morreale
Mehdi Noroozi
Abhinav Mehrotra

Paper Information

arXiv ID: 2602.06879v1
Categories: cs.CV, cs.AI
Published: February 6, 2026
PDF: Download PDF

[Paper] NanoFLUX: Distillation-Driven Compression of Large Text-to-Image Generation Models for Mobile Devices

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

[Paper] Reliable Mislabel Detection for Video Capsule Endoscopy Data

[Paper] PANC: Prior-Aware Normalized Cut for Object Segmentation

[Paper] Vision Transformer Finetuning Benefits from Non-Smooth Components