[Paper] One-step Latent-free Image Generation with Pixel Mean Flows
Source: arXiv - 2601.22158v1
Overview
The paper introduces Pixel MeanFlow (pMF), a novel approach that generates high‑resolution images in a single forward pass without relying on latent representations. By decoupling the network’s output space from its loss space, pMF bridges two recent trends—one‑step sampling and latent‑free generation—delivering ImageNet‑level quality (FID ≈ 2.2 @ 256², 2.5 @ 512²) with dramatically reduced inference cost.
Key Contributions
- One‑step, latent‑free generation: Produces photorealistic images in a single network evaluation, eliminating the multi‑step diffusion/flow pipeline.
- MeanFlow loss formulation: Introduces a velocity‑field‑based loss that operates on the MeanFlow of pixel intensities, while the network predicts directly on the image manifold.
- Simple image‑velocity transformation: Provides a mathematically tractable mapping between pixel values and their average velocity field, enabling stable training.
- State‑of‑the‑art FID scores on ImageNet at 256×256 and 512×512 resolutions, matching or surpassing multi‑step diffusion baselines.
- Scalable architecture: Compatible with existing convolutional and transformer backbones, making it easy to plug into current generative pipelines.
Methodology
-
Separate output & loss spaces
- Output: The network is trained to predict the final image
x(i.e., the point on the low‑dimensional image manifold). - Loss: Instead of a pixel‑wise L2 loss, the authors define a MeanFlow loss in the velocity space, measuring how the predicted image’s average pixel motion aligns with a ground‑truth flow field derived from the data distribution.
- Output: The network is trained to predict the final image
-
MeanFlow transformation
- For any image
x, they compute an average velocity fieldv = M(x)that captures the direction and magnitude of pixel changes needed to reachxfrom a reference distribution. - The inverse mapping
M⁻¹(v)reconstructs an image from its velocity field, ensuring a bijective relationship that keeps training stable.
- For any image
-
Training pipeline
- Sample a random noise image
z. - Pass
zthrough the generator to obtain a candidate imagex̂. - Compute
v̂ = M(x̂)and compare it to the target velocityv* = M(x_real)using a simple L2 loss in velocity space. - Back‑propagate the loss to update the generator; no iterative refinement or latent encoder is needed.
- Sample a random noise image
-
Network design
- The authors use a standard UNet‑style backbone with attention blocks, but the core idea works with any architecture that can map noise to pixel space.
Results & Findings
| Resolution | FID (lower is better) | Comparison (baseline) |
|---|---|---|
| 256×256 | 2.22 | Diffusion (multi‑step) ~2.5 |
| 512×512 | 2.48 | Diffusion (multi‑step) ~2.8 |
- Speed: Generation time drops from ~1 s (50‑step diffusion) to <10 ms on a single GPU, a >100× speedup.
- Quality: Visual inspection shows crisp textures and faithful class semantics, comparable to state‑of‑the‑art diffusion models.
- Stability: Training converges in ~300 k iterations, similar to conventional diffusion training, despite the radically different loss formulation.
Practical Implications
- Real‑time content creation: Developers can embed high‑quality image synthesis directly into interactive applications (e.g., game asset generation, UI mock‑ups) without waiting for multi‑step sampling.
- Edge deployment: The one‑step nature reduces memory bandwidth and compute cycles, making it feasible to run on consumer GPUs, mobile SoCs, or even WebGPU environments.
- Simplified pipelines: No need for separate latent encoders, scheduler designs, or sampling heuristics—just a single forward pass. This lowers engineering overhead for SaaS platforms offering on‑demand image generation.
- Cost reduction: Cloud inference costs drop dramatically when each request consumes milliseconds instead of seconds, enabling scalable APIs for generative services.
- Foundation for downstream tasks: The velocity‑field perspective could be repurposed for image editing, style transfer, or video frame interpolation, where controlling pixel motion is valuable.
Limitations & Future Work
- Training data dependence: The MeanFlow mapping is learned from the training distribution; out‑of‑distribution prompts may still suffer from mode collapse or artifacts.
- Limited conditional control: The current formulation focuses on unconditional generation; extending pMF to text‑to‑image or class‑conditional settings requires additional conditioning mechanisms.
- Theoretical guarantees: While the bijective mapping between image and velocity spaces works empirically, a rigorous analysis of its expressiveness and invertibility is left for future research.
- Broader benchmarks: Experiments are confined to ImageNet; evaluating on domain‑specific datasets (e.g., medical imaging, satellite data) will test the method’s generality.
Overall, Pixel MeanFlow marks a significant step toward ultra‑fast, high‑fidelity generative models that can be readily adopted by developers building the next generation of AI‑powered visual tools.
Authors
- Yiyang Lu
- Susie Lu
- Qiao Sun
- Hanhong Zhao
- Zhicheng Jiang
- Xianbang Wang
- Tianhong Li
- Zhengyang Geng
- Kaiming He
Paper Information
- arXiv ID: 2601.22158v1
- Categories: cs.CV
- Published: January 29, 2026
- PDF: Download PDF