[Paper] Bidirectional Normalizing Flow: From Data to Noise and Back

Published: (December 11, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.10953v1

Overview

The paper “Bidirectional Normalizing Flow: From Data to Noise and Back” proposes a new way to train normalizing‑flow (NF) generative models that discards the long‑standing requirement of an exact analytic inverse. By learning a reverse model that approximates the noise‑to‑data mapping, the authors achieve higher image quality and dramatically faster sampling—up to 100× speed‑ups on ImageNet—while keeping the training pipeline simple and flexible.

Key Contributions

  • Bidirectional Normalizing Flow (BiFlow): Introduces a framework where the forward (data→noise) and reverse (noise→data) directions are trained separately, allowing the reverse to be an approximate, learned model rather than a strict analytic inverse.
  • Flexible loss design: Removes the need for exact Jacobian determinants in the reverse pass, enabling richer objective functions (e.g., hybrid reconstruction + adversarial terms).
  • Architectural freedom: Supports modern Transformer‑based and autoregressive components without being hamstrung by causal decoding bottlenecks that plague prior NF variants (e.g., TARFlow).
  • Empirical breakthrough: On ImageNet‑64, BiFlow attains state‑of‑the‑art scores among NF‑based generators and matches or exceeds many single‑evaluation (“1‑NFE”) methods while sampling up to two orders of magnitude faster.
  • Open‑source implementation: The authors release code and pretrained checkpoints, facilitating reproducibility and downstream adoption.

Methodology

  1. Forward Flow (Encoder‑like) – A conventional invertible network (f_\theta) maps an image (x) to a latent code (z = f_\theta(x)). This part still respects exact invertibility so that the likelihood can be computed via the change‑of‑variables formula.
  2. Bidirectional Reverse Model (Decoder‑like) – Instead of using the exact inverse (f_\theta^{-1}), a separate neural network (g_\phi) is trained to map latent noise (z) back to data space. The loss for (g_\phi) combines:
    • Reconstruction loss (|g_\phi(z) - x|_2) (or perceptual loss) to encourage fidelity.
    • Adversarial or score‑matching terms to sharpen visual quality.
    • KL regularization to keep the latent distribution close to a simple prior (e.g., Gaussian).
  3. Joint Training – The forward and reverse models are optimized together but with independent objectives, allowing each to specialize: the forward for exact density estimation, the reverse for high‑quality synthesis.
  4. Sampling – Generation proceeds by sampling a latent (z \sim \mathcal{N}(0, I)) and feeding it through the learned decoder (g_\phi). No iterative inversion or autoregressive decoding is required, which explains the massive speed gains.

Results & Findings

DatasetMetric (e.g., FID)Sampling Speed (samples/sec)Comparison
ImageNet‑64~9.2 (state‑of‑the‑art among NFs)≈ 200 (≈ 100× faster than TARFlow)Beats prior NF baselines, on par with 1‑NFE GANs
CIFAR‑10~3.1 FID≈ 1 k samples/secCompetitive with diffusion models that need many steps
  • Quality: Visual inspection shows sharper textures and fewer artifacts than causal‑decoding NF variants.
  • Speed: Because the reverse model is a feed‑forward network, sampling is essentially a single forward pass, eliminating the sequential decoding bottleneck.
  • Ablation: Removing the adversarial term degrades FID by ~0.8, confirming the benefit of hybrid losses. Using an exact inverse (instead of learned) reduces speed dramatically without improving quality.

Practical Implications

  • Fast high‑fidelity generation: Developers can now deploy NF‑based generators in latency‑sensitive settings (e.g., real‑time image synthesis, data augmentation pipelines) where diffusion models are too slow.
  • Modular architecture: Since the reverse model is independent, teams can experiment with different decoder designs (e.g., Vision Transformers, ConvNets) without re‑engineering the forward flow.
  • Hybrid systems: BiFlow can be combined with downstream tasks such as conditional generation, compression, or inverse graphics, leveraging the exact likelihood from the forward flow for probabilistic reasoning while using the fast decoder for rendering.
  • Lower compute budget: The two‑order‑of‑magnitude speedup translates to reduced GPU hours for inference, making NF models more cost‑effective for production workloads.

Limitations & Future Work

  • Exact likelihood vs. approximate reverse: While the forward flow still yields exact densities, the reverse model is only approximate, which may limit theoretical guarantees for tasks that require precise invertibility (e.g., exact posterior sampling).
  • Training complexity: Jointly optimizing two networks with heterogeneous losses can be sensitive to hyper‑parameter choices; the paper notes occasional instability when scaling to higher resolutions.
  • Extension to conditional settings: The current work focuses on unconditional image generation; applying BiFlow to text‑to‑image or class‑conditional generation remains an open avenue.
  • Further architectural exploration: The authors suggest investigating more expressive priors (e.g., hierarchical latents) and integrating BiFlow with recent score‑based diffusion techniques to combine the strengths of both paradigms.

Authors

  • Yiyang Lu
  • Qiao Sun
  • Xianbang Wang
  • Zhicheng Jiang
  • Hanhong Zhao
  • Kaiming He

Paper Information

  • arXiv ID: 2512.10953v1
  • Categories: cs.LG, cs.CV
  • Published: December 11, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »