[Paper] Improved Mean Flows: On the Challenges of Fastforward Generative Models

Published: (December 1, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.02012v1

Overview

The paper introduces Improved MeanFlow (iMF), a new take on fast‑forward (single‑step) generative models that sidesteps two long‑standing hurdles in the original MeanFlow framework: an unstable training objective and a rigid guidance mechanism. By redefining the loss in terms of an instantaneous velocity field and making guidance a flexible conditioning input, iMF reaches a FID of 1.72 on ImageNet‑256 with just one function evaluation—matching or beating many multi‑step diffusion models while keeping the model size modest.

Key Contributions

  • Re‑parameterized training objective: Switches from a network‑dependent loss to a clean regression on the instantaneous velocity (v), stabilizing training.
  • Explicit, flexible guidance: Treats classifier‑free guidance scale as a conditioning variable rather than a fixed hyper‑parameter, enabling on‑the‑fly trade‑offs at inference.
  • In‑context conditioning pipeline: Packs diverse conditioning signals (e.g., class labels, guidance scale) into a single context vector, reducing parameter count and improving performance.
  • State‑of‑the‑art single‑step results: Achieves 1.72 FID on ImageNet‑256×256 with 1‑NFE, closing the quality gap to multi‑step diffusion models without any distillation tricks.
  • Fully trained from scratch: Demonstrates that fast‑forward models can be competitive without relying on pretrained diffusion checkpoints.

Methodology

  1. MeanFlow background – Traditional MeanFlow predicts an average velocity field (u) that, when integrated over a unit time step, yields a fast‑forward transformation from noise to data. The original formulation couples the loss to the network’s own output, making optimization noisy.

  2. Instantaneous velocity loss – iMF introduces a separate network that predicts the instantaneous velocity (v). The training objective becomes a straightforward mean‑squared error between predicted (v) and the ground‑truth instantaneous velocity derived from the data distribution. This decouples the loss from the model’s own predictions and turns the problem into a standard regression task.

  3. Guidance as conditioning – Instead of fixing the classifier‑free guidance scale (γ) during training, iMF feeds γ (and any other side information such as class tokens) into the model as part of an in‑context conditioning vector. At inference time, developers can vary γ to trade off fidelity vs. diversity without retraining.

  4. Model architecture – The authors use a UNet‑style backbone similar to diffusion models, but the conditioning vector is injected via cross‑attention layers, allowing a single set of weights to handle many guidance settings.

  5. Training regime – The model is trained end‑to‑end on ImageNet‑256 with standard data augmentations, Adam optimizer, and a cosine learning‑rate schedule. No teacher‑student distillation or multi‑step pre‑training is employed.

Results & Findings

MetriciMF (1‑NFE)Prior Fast‑forward (e.g., original MF)Multi‑step Diffusion (≈10‑NFE)
FID (ImageNet‑256)1.72> 3.01.5 – 2.0
Sampling time (per image)~ 30 ms (GPU)~ 30 ms~ 300 ms
Model size~ 300 M parameters~ 300 M500 M +
  • Training stability improves dramatically; loss curves are smooth and converge faster than the original MF.
  • Guidance flexibility: Varying γ at test time yields a smooth quality‑diversity curve, something the original MF could not provide.
  • No distillation needed: iMF matches the quality of diffusion models that rely on expensive teacher‑student pipelines, proving that a single‑step approach can stand on its own.

Practical Implications

  • Real‑time image generation: With a single network pass, developers can embed high‑fidelity generation into interactive apps (e.g., AI‑assisted design tools, game asset pipelines) without the latency penalties of multi‑step diffusion.
  • Dynamic trade‑offs: Because guidance scale is a runtime input, services can expose a “quality slider” to end‑users, adjusting fidelity on the fly based on bandwidth or compute constraints.
  • Reduced infrastructure cost: Fewer inference steps translate to lower GPU utilization, enabling cheaper cloud deployment or on‑device inference on high‑end mobile GPUs.
  • Simplified training pipelines: Training from scratch eliminates the need for large pretrained diffusion checkpoints, making it easier for organizations to train domain‑specific fast‑forward models (e.g., medical imaging, satellite data).
  • Compatibility with existing tooling: iMF’s UNet backbone and cross‑attention conditioning can be dropped into popular libraries (PyTorch, Diffusers) with minimal code changes.

Limitations & Future Work

  • Scalability to higher resolutions: The paper reports results up to 256×256; extending to 512×512 or beyond may require architectural tweaks or more compute.
  • Conditioning diversity: While class labels and guidance scale are supported, richer modalities (text, sketches) were not explored and could pose integration challenges.
  • Theoretical guarantees: The reformulation improves empirical stability, but a formal analysis of convergence or optimality under the new loss remains open.
  • Benchmark breadth: Evaluation focuses on ImageNet; testing on other domains (audio, video, 3‑D) would solidify the claim that fast‑forward modeling is a universal paradigm.

Overall, iMF pushes fast‑forward generative modeling from a research curiosity toward a practical tool that developers can adopt today.

Authors

  • Zhengyang Geng
  • Yiyang Lu
  • Zongze Wu
  • Eli Shechtman
  • J. Zico Kolter
  • Kaiming He

Paper Information

  • arXiv ID: 2512.02012v1
  • Categories: cs.CV, cs.LG
  • Published: December 1, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »