[Paper] Improved Mean Flows: On the Challenges of Fastforward Generative Models
Source: arXiv - 2512.02012v1
Overview
The paper introduces Improved MeanFlow (iMF), a new take on fast‑forward (single‑step) generative models that sidesteps two long‑standing hurdles in the original MeanFlow framework: an unstable training objective and a rigid guidance mechanism. By redefining the loss in terms of an instantaneous velocity field and making guidance a flexible conditioning input, iMF reaches a FID of 1.72 on ImageNet‑256 with just one function evaluation—matching or beating many multi‑step diffusion models while keeping the model size modest.
Key Contributions
- Re‑parameterized training objective: Switches from a network‑dependent loss to a clean regression on the instantaneous velocity (v), stabilizing training.
- Explicit, flexible guidance: Treats classifier‑free guidance scale as a conditioning variable rather than a fixed hyper‑parameter, enabling on‑the‑fly trade‑offs at inference.
- In‑context conditioning pipeline: Packs diverse conditioning signals (e.g., class labels, guidance scale) into a single context vector, reducing parameter count and improving performance.
- State‑of‑the‑art single‑step results: Achieves 1.72 FID on ImageNet‑256×256 with 1‑NFE, closing the quality gap to multi‑step diffusion models without any distillation tricks.
- Fully trained from scratch: Demonstrates that fast‑forward models can be competitive without relying on pretrained diffusion checkpoints.
Methodology
-
MeanFlow background – Traditional MeanFlow predicts an average velocity field (u) that, when integrated over a unit time step, yields a fast‑forward transformation from noise to data. The original formulation couples the loss to the network’s own output, making optimization noisy.
-
Instantaneous velocity loss – iMF introduces a separate network that predicts the instantaneous velocity (v). The training objective becomes a straightforward mean‑squared error between predicted (v) and the ground‑truth instantaneous velocity derived from the data distribution. This decouples the loss from the model’s own predictions and turns the problem into a standard regression task.
-
Guidance as conditioning – Instead of fixing the classifier‑free guidance scale (γ) during training, iMF feeds γ (and any other side information such as class tokens) into the model as part of an in‑context conditioning vector. At inference time, developers can vary γ to trade off fidelity vs. diversity without retraining.
-
Model architecture – The authors use a UNet‑style backbone similar to diffusion models, but the conditioning vector is injected via cross‑attention layers, allowing a single set of weights to handle many guidance settings.
-
Training regime – The model is trained end‑to‑end on ImageNet‑256 with standard data augmentations, Adam optimizer, and a cosine learning‑rate schedule. No teacher‑student distillation or multi‑step pre‑training is employed.
Results & Findings
| Metric | iMF (1‑NFE) | Prior Fast‑forward (e.g., original MF) | Multi‑step Diffusion (≈10‑NFE) |
|---|---|---|---|
| FID (ImageNet‑256) | 1.72 | > 3.0 | 1.5 – 2.0 |
| Sampling time (per image) | ~ 30 ms (GPU) | ~ 30 ms | ~ 300 ms |
| Model size | ~ 300 M parameters | ~ 300 M | 500 M + |
- Training stability improves dramatically; loss curves are smooth and converge faster than the original MF.
- Guidance flexibility: Varying γ at test time yields a smooth quality‑diversity curve, something the original MF could not provide.
- No distillation needed: iMF matches the quality of diffusion models that rely on expensive teacher‑student pipelines, proving that a single‑step approach can stand on its own.
Practical Implications
- Real‑time image generation: With a single network pass, developers can embed high‑fidelity generation into interactive apps (e.g., AI‑assisted design tools, game asset pipelines) without the latency penalties of multi‑step diffusion.
- Dynamic trade‑offs: Because guidance scale is a runtime input, services can expose a “quality slider” to end‑users, adjusting fidelity on the fly based on bandwidth or compute constraints.
- Reduced infrastructure cost: Fewer inference steps translate to lower GPU utilization, enabling cheaper cloud deployment or on‑device inference on high‑end mobile GPUs.
- Simplified training pipelines: Training from scratch eliminates the need for large pretrained diffusion checkpoints, making it easier for organizations to train domain‑specific fast‑forward models (e.g., medical imaging, satellite data).
- Compatibility with existing tooling: iMF’s UNet backbone and cross‑attention conditioning can be dropped into popular libraries (PyTorch, Diffusers) with minimal code changes.
Limitations & Future Work
- Scalability to higher resolutions: The paper reports results up to 256×256; extending to 512×512 or beyond may require architectural tweaks or more compute.
- Conditioning diversity: While class labels and guidance scale are supported, richer modalities (text, sketches) were not explored and could pose integration challenges.
- Theoretical guarantees: The reformulation improves empirical stability, but a formal analysis of convergence or optimality under the new loss remains open.
- Benchmark breadth: Evaluation focuses on ImageNet; testing on other domains (audio, video, 3‑D) would solidify the claim that fast‑forward modeling is a universal paradigm.
Overall, iMF pushes fast‑forward generative modeling from a research curiosity toward a practical tool that developers can adopt today.
Authors
- Zhengyang Geng
- Yiyang Lu
- Zongze Wu
- Eli Shechtman
- J. Zico Kolter
- Kaiming He
Paper Information
- arXiv ID: 2512.02012v1
- Categories: cs.CV, cs.LG
- Published: December 1, 2025
- PDF: Download PDF