[Paper] Improved Mean Flows: On the Challenges of Fastforward Generative Models

Published: 3 days ago (December 1, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.02012v1

Overview

The paper introduces Improved MeanFlow (iMF), a new take on fast‑forward (single‑step) generative models that sidesteps two long‑standing hurdles in the original MeanFlow framework: an unstable training objective and a rigid guidance mechanism. By redefining the loss in terms of an instantaneous velocity field and making guidance a flexible conditioning input, iMF reaches a FID of 1.72 on ImageNet‑256 with just one function evaluation—matching or beating many multi‑step diffusion models while keeping the model size modest.

Key Contributions

Re‑parameterized training objective: Switches from a network‑dependent loss to a clean regression on the instantaneous velocity (v), stabilizing training.
Explicit, flexible guidance: Treats classifier‑free guidance scale as a conditioning variable rather than a fixed hyper‑parameter, enabling on‑the‑fly trade‑offs at inference.
In‑context conditioning pipeline: Packs diverse conditioning signals (e.g., class labels, guidance scale) into a single context vector, reducing parameter count and improving performance.
State‑of‑the‑art single‑step results: Achieves 1.72 FID on ImageNet‑256×256 with 1‑NFE, closing the quality gap to multi‑step diffusion models without any distillation tricks.
Fully trained from scratch: Demonstrates that fast‑forward models can be competitive without relying on pretrained diffusion checkpoints.

Methodology

MeanFlow background – Traditional MeanFlow predicts an average velocity field (u) that, when integrated over a unit time step, yields a fast‑forward transformation from noise to data. The original formulation couples the loss to the network’s own output, making optimization noisy.
Instantaneous velocity loss – iMF introduces a separate network that predicts the instantaneous velocity (v). The training objective becomes a straightforward mean‑squared error between predicted (v) and the ground‑truth instantaneous velocity derived from the data distribution. This decouples the loss from the model’s own predictions and turns the problem into a standard regression task.
Guidance as conditioning – Instead of fixing the classifier‑free guidance scale (γ) during training, iMF feeds γ (and any other side information such as class tokens) into the model as part of an in‑context conditioning vector. At inference time, developers can vary γ to trade off fidelity vs. diversity without retraining.
Model architecture – The authors use a UNet‑style backbone similar to diffusion models, but the conditioning vector is injected via cross‑attention layers, allowing a single set of weights to handle many guidance settings.
Training regime – The model is trained end‑to‑end on ImageNet‑256 with standard data augmentations, Adam optimizer, and a cosine learning‑rate schedule. No teacher‑student distillation or multi‑step pre‑training is employed.

Results & Findings

Metric	iMF (1‑NFE)	Prior Fast‑forward (e.g., original MF)	Multi‑step Diffusion (≈10‑NFE)
FID (ImageNet‑256)	1.72	> 3.0	1.5 – 2.0
Sampling time (per image)	~ 30 ms (GPU)	~ 30 ms	~ 300 ms
Model size	~ 300 M parameters	~ 300 M	500 M +

Training stability improves dramatically; loss curves are smooth and converge faster than the original MF.
Guidance flexibility: Varying γ at test time yields a smooth quality‑diversity curve, something the original MF could not provide.
No distillation needed: iMF matches the quality of diffusion models that rely on expensive teacher‑student pipelines, proving that a single‑step approach can stand on its own.

Practical Implications

Real‑time image generation: With a single network pass, developers can embed high‑fidelity generation into interactive apps (e.g., AI‑assisted design tools, game asset pipelines) without the latency penalties of multi‑step diffusion.
Dynamic trade‑offs: Because guidance scale is a runtime input, services can expose a “quality slider” to end‑users, adjusting fidelity on the fly based on bandwidth or compute constraints.
Reduced infrastructure cost: Fewer inference steps translate to lower GPU utilization, enabling cheaper cloud deployment or on‑device inference on high‑end mobile GPUs.
Simplified training pipelines: Training from scratch eliminates the need for large pretrained diffusion checkpoints, making it easier for organizations to train domain‑specific fast‑forward models (e.g., medical imaging, satellite data).
Compatibility with existing tooling: iMF’s UNet backbone and cross‑attention conditioning can be dropped into popular libraries (PyTorch, Diffusers) with minimal code changes.

Limitations & Future Work

Scalability to higher resolutions: The paper reports results up to 256×256; extending to 512×512 or beyond may require architectural tweaks or more compute.
Conditioning diversity: While class labels and guidance scale are supported, richer modalities (text, sketches) were not explored and could pose integration challenges.
Theoretical guarantees: The reformulation improves empirical stability, but a formal analysis of convergence or optimality under the new loss remains open.
Benchmark breadth: Evaluation focuses on ImageNet; testing on other domains (audio, video, 3‑D) would solidify the claim that fast‑forward modeling is a universal paradigm.

Overall, iMF pushes fast‑forward generative modeling from a research curiosity toward a practical tool that developers can adopt today.

Authors

Zhengyang Geng
Yiyang Lu
Zongze Wu
Eli Shechtman
J. Zico Kolter
Kaiming He

Paper Information

arXiv ID: 2512.02012v1
Categories: cs.CV, cs.LG
Published: December 1, 2025
PDF: Download PDF

[Paper] Improved Mean Flows: On the Challenges of Fastforward Generative Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] The Universal Weight Subspace Hypothesis

[Paper] Value Gradient Guidance for Flow Matching Alignment

[Paper] Deep infant brain segmentation from multi-contrast MRI

[Paper] DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation