[Paper] One-step Latent-free Image Generation with Pixel Mean Flows

Published: (January 29, 2026 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.22158v1

Overview

The paper introduces Pixel MeanFlow (pMF), a novel approach that generates high‑resolution images in a single forward pass without relying on latent representations. By decoupling the network’s output space from its loss space, pMF bridges two recent trends—one‑step sampling and latent‑free generation—delivering ImageNet‑level quality (FID ≈ 2.2 @ 256², 2.5 @ 512²) with dramatically reduced inference cost.

Key Contributions

  • One‑step, latent‑free generation: Produces photorealistic images in a single network evaluation, eliminating the multi‑step diffusion/flow pipeline.
  • MeanFlow loss formulation: Introduces a velocity‑field‑based loss that operates on the MeanFlow of pixel intensities, while the network predicts directly on the image manifold.
  • Simple image‑velocity transformation: Provides a mathematically tractable mapping between pixel values and their average velocity field, enabling stable training.
  • State‑of‑the‑art FID scores on ImageNet at 256×256 and 512×512 resolutions, matching or surpassing multi‑step diffusion baselines.
  • Scalable architecture: Compatible with existing convolutional and transformer backbones, making it easy to plug into current generative pipelines.

Methodology

  1. Separate output & loss spaces

    • Output: The network is trained to predict the final image x (i.e., the point on the low‑dimensional image manifold).
    • Loss: Instead of a pixel‑wise L2 loss, the authors define a MeanFlow loss in the velocity space, measuring how the predicted image’s average pixel motion aligns with a ground‑truth flow field derived from the data distribution.
  2. MeanFlow transformation

    • For any image x, they compute an average velocity field v = M(x) that captures the direction and magnitude of pixel changes needed to reach x from a reference distribution.
    • The inverse mapping M⁻¹(v) reconstructs an image from its velocity field, ensuring a bijective relationship that keeps training stable.
  3. Training pipeline

    • Sample a random noise image z.
    • Pass z through the generator to obtain a candidate image .
    • Compute v̂ = M(x̂) and compare it to the target velocity v* = M(x_real) using a simple L2 loss in velocity space.
    • Back‑propagate the loss to update the generator; no iterative refinement or latent encoder is needed.
  4. Network design

    • The authors use a standard UNet‑style backbone with attention blocks, but the core idea works with any architecture that can map noise to pixel space.

Results & Findings

ResolutionFID (lower is better)Comparison (baseline)
256×2562.22Diffusion (multi‑step) ~2.5
512×5122.48Diffusion (multi‑step) ~2.8
  • Speed: Generation time drops from ~1 s (50‑step diffusion) to <10 ms on a single GPU, a >100× speedup.
  • Quality: Visual inspection shows crisp textures and faithful class semantics, comparable to state‑of‑the‑art diffusion models.
  • Stability: Training converges in ~300 k iterations, similar to conventional diffusion training, despite the radically different loss formulation.

Practical Implications

  • Real‑time content creation: Developers can embed high‑quality image synthesis directly into interactive applications (e.g., game asset generation, UI mock‑ups) without waiting for multi‑step sampling.
  • Edge deployment: The one‑step nature reduces memory bandwidth and compute cycles, making it feasible to run on consumer GPUs, mobile SoCs, or even WebGPU environments.
  • Simplified pipelines: No need for separate latent encoders, scheduler designs, or sampling heuristics—just a single forward pass. This lowers engineering overhead for SaaS platforms offering on‑demand image generation.
  • Cost reduction: Cloud inference costs drop dramatically when each request consumes milliseconds instead of seconds, enabling scalable APIs for generative services.
  • Foundation for downstream tasks: The velocity‑field perspective could be repurposed for image editing, style transfer, or video frame interpolation, where controlling pixel motion is valuable.

Limitations & Future Work

  • Training data dependence: The MeanFlow mapping is learned from the training distribution; out‑of‑distribution prompts may still suffer from mode collapse or artifacts.
  • Limited conditional control: The current formulation focuses on unconditional generation; extending pMF to text‑to‑image or class‑conditional settings requires additional conditioning mechanisms.
  • Theoretical guarantees: While the bijective mapping between image and velocity spaces works empirically, a rigorous analysis of its expressiveness and invertibility is left for future research.
  • Broader benchmarks: Experiments are confined to ImageNet; evaluating on domain‑specific datasets (e.g., medical imaging, satellite data) will test the method’s generality.

Overall, Pixel MeanFlow marks a significant step toward ultra‑fast, high‑fidelity generative models that can be readily adopted by developers building the next generation of AI‑powered visual tools.

Authors

  • Yiyang Lu
  • Susie Lu
  • Qiao Sun
  • Hanhong Zhao
  • Zhicheng Jiang
  • Xianbang Wang
  • Tianhong Li
  • Zhengyang Geng
  • Kaiming He

Paper Information

  • arXiv ID: 2601.22158v1
  • Categories: cs.CV
  • Published: January 29, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »