[Paper] ArcFlow: Unleashing 2-Step Text-to-Image Generation via High-Precision Non-Linear Flow Distillation

Published: (February 9, 2026 at 01:56 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.09014v1

Overview

ArcFlow tackles one of the biggest pain points of diffusion‑based text‑to‑image models: the need for dozens or even hundreds of denoising steps at inference time. By distilling a large, high‑quality diffusion teacher into a 2‑step generator that follows a non‑linear flow, the authors achieve roughly 40× faster generation while keeping visual fidelity almost intact. This makes high‑end diffusion models far more practical for real‑time or resource‑constrained applications.

Key Contributions

  • Non‑linear flow distillation: Introduces a continuous momentum‑based parameterization that captures the evolving velocity of the teacher’s diffusion trajectory, rather than relying on simple linear shortcuts.
  • Analytical integration: Derives a closed‑form solution for the non‑linear trajectory, eliminating discretization errors that typically arise when approximating diffusion steps numerically.
  • Lightweight adapters: Implements the distilled student with < 5 % of the teacher’s parameters, enabling fast fine‑tuning on existing large models (e.g., Qwen‑Image‑20B, FLUX.1‑dev).
  • 2‑step inference: Demonstrates that a 2‑step (2 NFEs) generator can match the quality of the original multi‑step teacher across standard benchmarks, delivering a ~40× speedup.
  • Extensive evaluation: Provides both quantitative metrics (FID, CLIP‑Score) and qualitative visual comparisons, confirming minimal quality loss despite the drastic reduction in steps.

Methodology

  1. Teacher trajectory extraction – The pre‑trained diffusion model (the “teacher”) defines a sequence of latent states and associated velocities (the “tangent” of the diffusion path) across many timesteps.
  2. Mixture of continuous momentum processes – ArcFlow models the velocity field as a weighted sum of simple continuous dynamics (think of several “mini‑physics engines” running in parallel). This mixture can flexibly adapt its direction and magnitude as time progresses, mimicking the teacher’s changing velocity.
  3. Analytical integration – Because the mixture has a known mathematical form, the authors integrate it analytically to obtain the exact latent state after a single denoising step, sidestepping the need for Euler or Runge‑Kutta approximations.
  4. Adapter‑based distillation – Small trainable modules (adapters) are inserted into the teacher’s architecture. During distillation, these adapters learn to produce the parameters of the momentum mixture that best reproduce the teacher’s trajectory over just two large steps.
  5. Training objective – The student is optimized to minimize the distance between its analytically integrated states and the teacher’s true states, while also preserving diversity via standard diffusion losses (e.g., reconstruction and classifier‑free guidance terms).

Results & Findings

Model (Teacher)Student (ArcFlow)NFEsSpeedup vs. TeacherFID ↓CLIP‑Score ↔
Qwen‑Image‑20BArcFlow‑20B‑2step2~40×+2.1 (≈ same)±0.01 (negligible)
FLUX.1‑devArcFlow‑FLUX‑2step2~38×+1.8±0.02
  • Visual quality: Side‑by‑side samples show that the 2‑step outputs retain fine details, color fidelity, and composition comparable to the original 50‑step diffusion results.
  • Stability: Training converges within a few hours on a single A100 GPU, thanks to the analytical integration that provides a smooth loss landscape.
  • Parameter efficiency: Only ~4 % of the teacher’s weights are updated, meaning the bulk of the model can be reused unchanged.

Practical Implications

  • Real‑time generation: Applications like AI‑assisted design tools, interactive storyboarding, or on‑device image synthesis can now run diffusion‑level quality models with latency comparable to GANs.
  • Cost reduction: Cloud inference bills drop dramatically when the number of NFEs falls from ~50 to 2, making large‑scale diffusion services economically viable.
  • Edge deployment: The lightweight adapter approach means the heavy backbone can stay frozen while a small fine‑tuned module runs on consumer‑grade hardware (e.g., RTX‑30xx series, Apple M‑series).
  • Rapid prototyping: Developers can take any existing diffusion checkpoint, attach ArcFlow adapters, and obtain a fast 2‑step version without retraining the whole model from scratch.

Limitations & Future Work

  • Residual quality gap: Although minimal, a slight degradation in FID remains, especially for highly complex scenes with intricate textures.
  • Scope of adapters: The current implementation fine‑tunes only a tiny fraction of parameters; exploring richer adapter architectures could further close the quality gap.
  • Generalization to other modalities: The paper focuses on text‑to‑image; extending ArcFlow to video diffusion or audio generation will require handling higher‑dimensional trajectories.
  • Theoretical guarantees: While the analytical integration eliminates discretization error, formal bounds on how closely the distilled flow matches the teacher’s optimal trajectory are left for future analysis.

ArcFlow demonstrates that with a clever mathematical re‑thinking of diffusion dynamics, we can bring the impressive quality of large diffusion models into the fast‑lane of practical deployment. For developers eager to harness state‑of‑the‑art text‑to‑image generation without the usual latency penalty, ArcFlow offers a ready‑to‑use recipe that bridges the gap between research and production.

Authors

  • Zihan Yang
  • Shuyuan Tu
  • Licheng Zhang
  • Qi Dai
  • Yu-Gang Jiang
  • Zuxuan Wu

Paper Information

  • arXiv ID: 2602.09014v1
  • Categories: cs.CV, cs.AI
  • Published: February 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »