[Paper] One-step Latent-free Image Generation with Pixel Mean Flows

Published: 1 week ago (January 29, 2026 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.22158v1

Overview

The paper introduces Pixel MeanFlow (pMF), a novel approach that generates high‑resolution images in a single forward pass without relying on latent representations. By decoupling the network’s output space from its loss space, pMF bridges two recent trends—one‑step sampling and latent‑free generation—delivering ImageNet‑level quality (FID ≈ 2.2 @ 256², 2.5 @ 512²) with dramatically reduced inference cost.

Key Contributions

One‑step, latent‑free generation: Produces photorealistic images in a single network evaluation, eliminating the multi‑step diffusion/flow pipeline.
MeanFlow loss formulation: Introduces a velocity‑field‑based loss that operates on the MeanFlow of pixel intensities, while the network predicts directly on the image manifold.
Simple image‑velocity transformation: Provides a mathematically tractable mapping between pixel values and their average velocity field, enabling stable training.
State‑of‑the‑art FID scores on ImageNet at 256×256 and 512×512 resolutions, matching or surpassing multi‑step diffusion baselines.
Scalable architecture: Compatible with existing convolutional and transformer backbones, making it easy to plug into current generative pipelines.

Methodology

Separate output & loss spaces
- Output: The network is trained to predict the final image x (i.e., the point on the low‑dimensional image manifold).
- Loss: Instead of a pixel‑wise L2 loss, the authors define a MeanFlow loss in the velocity space, measuring how the predicted image’s average pixel motion aligns with a ground‑truth flow field derived from the data distribution.
MeanFlow transformation
- For any image x, they compute an average velocity field v = M(x) that captures the direction and magnitude of pixel changes needed to reach x from a reference distribution.
- The inverse mapping M⁻¹(v) reconstructs an image from its velocity field, ensuring a bijective relationship that keeps training stable.
Training pipeline
- Sample a random noise image z.
- Pass z through the generator to obtain a candidate image x̂.
- Compute v̂ = M(x̂) and compare it to the target velocity v* = M(x_real) using a simple L2 loss in velocity space.
- Back‑propagate the loss to update the generator; no iterative refinement or latent encoder is needed.
Network design
- The authors use a standard UNet‑style backbone with attention blocks, but the core idea works with any architecture that can map noise to pixel space.

Results & Findings

Resolution	FID (lower is better)	Comparison (baseline)
256×256	2.22	Diffusion (multi‑step) ~2.5
512×512	2.48	Diffusion (multi‑step) ~2.8

Speed: Generation time drops from ~1 s (50‑step diffusion) to <10 ms on a single GPU, a >100× speedup.
Quality: Visual inspection shows crisp textures and faithful class semantics, comparable to state‑of‑the‑art diffusion models.
Stability: Training converges in ~300 k iterations, similar to conventional diffusion training, despite the radically different loss formulation.

Practical Implications

Real‑time content creation: Developers can embed high‑quality image synthesis directly into interactive applications (e.g., game asset generation, UI mock‑ups) without waiting for multi‑step sampling.
Edge deployment: The one‑step nature reduces memory bandwidth and compute cycles, making it feasible to run on consumer GPUs, mobile SoCs, or even WebGPU environments.
Simplified pipelines: No need for separate latent encoders, scheduler designs, or sampling heuristics—just a single forward pass. This lowers engineering overhead for SaaS platforms offering on‑demand image generation.
Cost reduction: Cloud inference costs drop dramatically when each request consumes milliseconds instead of seconds, enabling scalable APIs for generative services.
Foundation for downstream tasks: The velocity‑field perspective could be repurposed for image editing, style transfer, or video frame interpolation, where controlling pixel motion is valuable.

Limitations & Future Work

Training data dependence: The MeanFlow mapping is learned from the training distribution; out‑of‑distribution prompts may still suffer from mode collapse or artifacts.
Limited conditional control: The current formulation focuses on unconditional generation; extending pMF to text‑to‑image or class‑conditional settings requires additional conditioning mechanisms.
Theoretical guarantees: While the bijective mapping between image and velocity spaces works empirically, a rigorous analysis of its expressiveness and invertibility is left for future research.
Broader benchmarks: Experiments are confined to ImageNet; evaluating on domain‑specific datasets (e.g., medical imaging, satellite data) will test the method’s generality.

Overall, Pixel MeanFlow marks a significant step toward ultra‑fast, high‑fidelity generative models that can be readily adopted by developers building the next generation of AI‑powered visual tools.

Authors

Yiyang Lu
Susie Lu
Qiao Sun
Hanhong Zhao
Zhicheng Jiang
Xianbang Wang
Tianhong Li
Zhengyang Geng
Kaiming He

Paper Information

arXiv ID: 2601.22158v1
Categories: cs.CV
Published: January 29, 2026
PDF: Download PDF

[Paper] One-step Latent-free Image Generation with Pixel Mean Flows

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

[Paper] User Prompting Strategies and Prompt Enhancement Methods for Open-Set Object Detection in XR Environments

[Paper] Denoising the Deep Sky: Physics-Based CCD Noise Formation for Astronomical Imaging

[Paper] PaperBanana: Automating Academic Illustration for AI Scientists