[Paper] NanoFLUX: Distillation-Driven Compression of Large Text-to-Image Generation Models for Mobile Devices

Published: (February 6, 2026 at 12:05 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2602.06879v1

Overview

The paper presents NanoFLUX, a compact 2.4 B‑parameter text‑to‑image generation model that runs on modern smartphones in just a few seconds. By distilling a 17 B‑parameter diffusion model (FLUX.1‑Schnell) and applying a series of clever compression tricks, the authors bridge the gap between state‑of‑the‑art visual synthesis and on‑device deployment.

Key Contributions

  • Progressive model compression that trims the diffusion transformer from 12 B to ~2 B parameters while keeping visual fidelity.
  • ResNet‑based token down‑sampling, enabling early transformer blocks to work on lower‑resolution token maps, cutting latency without sacrificing final image quality.
  • Text‑encoder distillation that injects visual cues from the denoiser’s early layers into the language representation, improving text‑image alignment.
  • Real‑world benchmark: 512 × 512 images generated in ~2.5 s on a typical mobile device, a first for high‑resolution diffusion models on‑device.

Methodology

  1. Teacher‑Student Distillation – The large FLUX.1‑Schnell model (the “teacher”) generates reference images and intermediate denoising features. A smaller “student” network learns to mimic both the final output and the intermediate dynamics, preserving the diffusion process’s expressive power.

  2. Transformer Pruning – Redundant attention heads and feed‑forward dimensions are identified via sensitivity analysis and removed. This reduces the transformer’s parameter count from 12 B to roughly 2 B while keeping the most informative pathways.

  3. ResNet Token Down‑Sampler – Before the first few transformer layers, a lightweight ResNet reduces the spatial token resolution (e.g., from 64 × 64 to 32 × 32). Later layers upscale the tokens back, allowing the bulk of computation to happen on a smaller representation.

  4. Cross‑Modal Text Encoder Distillation – The text encoder is trained not only on language data but also to predict visual features extracted from early denoiser layers. This aligns textual embeddings more closely with the visual generation pipeline, improving prompt adherence.

  5. Progressive Fine‑Tuning – After each compression step, the model is fine‑tuned on the original diffusion training set to recover any lost quality, resulting in a smooth “compression ladder” from teacher to final student.

Results & Findings

  • Speed: 512 × 512 image generation in ≈2.5 s on a flagship Android phone (Snapdragon 8 Gen 2) using a single GPU core.
  • Quality: Human evaluation and CLIP‑based similarity scores show only a ≈5 % drop compared to the 17 B teacher, which is barely perceptible for most consumer use‑cases.
  • Parameter Efficiency: The final model occupies ~2 GB of storage (compressed) and fits comfortably within typical mobile memory budgets.
  • Ablation: Removing the token down‑sampler increases latency by ~40 % with negligible quality gain; omitting text‑encoder distillation leads to a noticeable decline in prompt fidelity (≈12 % lower CLIP‑score).

Practical Implications

  • On‑Device Creative Apps – Developers can embed high‑resolution text‑to‑image generation directly into photo editors, AR filters, or social‑media stickers without relying on cloud APIs, reducing latency and preserving user privacy.
  • Edge‑AI Services – Enterprises can deploy personalized content generation (e.g., marketing visuals, product mock‑ups) on edge devices, lowering bandwidth costs and enabling offline operation.
  • Rapid Prototyping – The compression pipeline can be adapted to other diffusion models (e.g., video or 3‑D generation), offering a roadmap for bringing more generative AI capabilities to the edge.
  • Energy Efficiency – Running locally avoids the energy overhead of data transmission to servers, which is especially valuable for battery‑constrained devices.

Limitations & Future Work

  • Hardware Dependency – The reported 2.5 s latency assumes a high‑end mobile GPU; lower‑tier devices will see slower performance.
  • Resolution Ceiling – While 512 × 512 is impressive, scaling to 1024 × 1024 still requires cloud resources.
  • Generalization – The model was distilled on the same data distribution as FLUX.1‑Schnell; performance on out‑of‑domain prompts may degrade.
  • Future Directions – The authors suggest exploring quantization‑aware training, mixed‑precision inference, and extending the token‑down‑sampling concept to multi‑modal diffusion pipelines (e.g., text‑to‑video).

Authors

  • Ruchika Chavhan
  • Malcolm Chadwick
  • Alberto Gil Couto Pimentel Ramos
  • Luca Morreale
  • Mehdi Noroozi
  • Abhinav Mehrotra

Paper Information

  • arXiv ID: 2602.06879v1
  • Categories: cs.CV, cs.AI
  • Published: February 6, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »