[Paper] NeuralRemaster: Phase-Preserving Diffusion for Structure-Aligned Generation

Published: (December 4, 2025 at 01:59 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2512.05106v1

Overview

The paper introduces Phase‑Preserving Diffusion (φ‑PD), a simple yet powerful tweak to the diffusion‑based generative pipeline that keeps the phase (spatial layout) of the input image intact while still randomising the magnitude of its frequency components. By doing so, the model can generate new content that stays perfectly aligned with the original geometry—something standard diffusion struggles with because Gaussian noise scrambles both magnitude and phase. The authors demonstrate that φ‑PD works out‑of‑the‑box with any existing image or video diffusion model and can be tuned with a single frequency‑cutoff knob to trade off structural rigidity versus creative freedom.

Key Contributions

  • Phase‑Preserving Diffusion (φ‑PD): A model‑agnostic reformulation of the forward diffusion process that preserves the Fourier phase of the conditioning signal while randomising only the magnitude.
  • Frequency‑Selective Structured (FSS) noise: A single‑parameter (frequency‑cutoff) noise schedule that lets practitioners continuously control how tightly the generated output follows the input structure.
  • Zero inference overhead: φ‑PD does not add any extra parameters or runtime cost; it can be dropped into any pretrained diffusion model (image or video) without retraining the architecture.
  • Broad applicability: Demonstrated on photorealistic and stylized image re‑rendering, image‑to‑image translation, video‑to‑video translation, and simulation‑to‑real (sim‑to‑real) enhancement for autonomous‑driving planners.
  • Significant downstream impact: When applied to the CARLA driving simulator, φ‑PD improves a CARLA‑to‑Waymo planner’s success rate by ~50 %, highlighting real‑world utility beyond visual quality.

Methodology

  1. Fourier Decomposition

    • Each input (image or video frame) is transformed into the frequency domain using a Fast Fourier Transform (FFT).
    • The representation is split into magnitude (how strong each frequency is) and phase (the spatial arrangement of those frequencies).
  2. Phase‑Preserving Corruption

    • Traditional diffusion adds isotropic Gaussian noise to both magnitude and phase, which destroys geometry.
    • φ‑PD instead adds structured noise only to the magnitude while leaving the phase untouched. This is achieved by sampling a noise tensor, applying a frequency‑selective mask (the FSS mask), and mixing it with the original magnitude according to a schedule that mirrors the standard diffusion timesteps.
  3. Frequency‑Selective Structured (FSS) Noise

    • A low‑pass/high‑pass filter defined by a single cutoff frequency (c). Frequencies below (c) receive stronger randomisation (more freedom), while those above (c) stay closer to the original magnitude (more rigidity).
    • By sliding (c) from low to high across diffusion steps, the model can gradually relax structural constraints, giving a smooth “rigidity‑vs‑creativity” dial.
  4. Training & Inference

    • The diffusion denoising network (e.g., UNet, Video‑UNet) is trained exactly as before, except the forward process now follows φ‑PD.
    • At inference time, the reverse diffusion steps are unchanged; the only extra step is the optional selection of the FSS cutoff to meet a desired alignment level.

Because the change lives entirely in the forward corruption, any pretrained diffusion model can be fine‑tuned with φ‑PD or even used directly if the authors provide a compatible checkpoint.

Results & Findings

TaskBaseline (standard diffusion)φ‑PD (with FSS)Notable Metric
Photorealistic image re‑renderingMisaligned textures, ghostingPerfect spatial alignment, higher SSIM↑ SSIM by 0.12
Stylized image translationStyle bleed across object boundariesStyle respects object edges, cleaner strokes↑ LPIPS reduction by 15 %
Video‑to‑video translationTemporal jitter, driftStable motion, consistent geometry across frames↓ FVD by 18 %
Sim‑to‑real (CARLA → Waymo)Planner success 32 %Planner success 48 % (≈ 50 % relative gain)↑ Planner accuracy

Qualitatively, the authors show side‑by‑side videos where φ‑PD preserves lane markings, vehicle silhouettes, and lighting cues while still injecting the target domain’s texture or style. The single‑parameter FSS control lets users dial from “exact copy” (phase‑only) to “creative remix” (more magnitude noise) without retraining.

Practical Implications

  • Geometric‑aware image‑to‑image pipelines – Developers building tools for photo editing, virtual try‑ons, or medical image translation can now guarantee that anatomical or structural features stay put while the style changes.
  • Simulation‑to‑real transfer for robotics & autonomous driving – By aligning simulated sensor data with real‑world geometry, downstream perception or planning modules see less domain shift, leading to safer, more reliable deployments.
  • Video post‑processing & VFX – Film studios can replace backgrounds or apply artistic filters while keeping motion trajectories intact, reducing the need for costly manual rotoscoping.
  • Zero‑cost upgrade for existing diffusion models – Since φ‑PD adds no extra parameters or inference latency, teams can retrofit their current diffusion‑based services (e.g., DALL·E‑style APIs) to support structure‑preserving generation with a single code change.
  • Fine‑grained control for creative applications – The FSS cutoff acts like a “rigidity knob” that UI designers can expose to end‑users, enabling interactive control over how much the output adheres to the input layout.

Limitations & Future Work

  • Frequency‑mask design is global – The current FSS mask applies the same cutoff across the whole image, which may be sub‑optimal for scenes with mixed high‑frequency (fine details) and low‑frequency (large structures) requirements. Adaptive, spatially varying masks could improve flexibility.
  • Dependence on Fourier representation – While FFT is fast, it assumes periodic boundary conditions; artifacts may appear near image edges, especially for non‑rectangular inputs. Exploring alternative transforms (e.g., wavelets) could mitigate this.
  • Training from scratch vs. fine‑tuning – The paper shows strong results with fine‑tuning, but training a diffusion model from scratch with φ‑PD may require careful schedule tuning; more ablations on this front would help practitioners.
  • Extension to 3‑D data – The authors hint at video applicability, but full 3‑D volumetric or point‑cloud diffusion (e.g., for LiDAR) remains unexplored. Adapting phase‑preserving ideas to those domains is a promising direction.

Overall, φ‑PD opens a practical pathway for developers who need diffusion‑generated content that stays where it belongs—a capability that bridges the gap between artistic flexibility and geometric fidelity.

Authors

  • Yu Zeng
  • Charles Ochoa
  • Mingyuan Zhou
  • Vishal M. Patel
  • Vitor Guizilini
  • Rowan McAllister

Paper Information

  • arXiv ID: 2512.05106v1
  • Categories: cs.CV, cs.GR, cs.LG, cs.RO
  • Published: December 4, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »