[Paper] Video Generation Models Are Good Latent Reward Models

Published: (November 26, 2025 at 11:14 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.21541v1

Overview

The paper introduces Process Reward Feedback Learning (PRFL), a new way to align video‑generation models with human preferences without ever leaving the latent space. By leveraging the inherent structure of pre‑trained video diffusion models, PRFL sidesteps costly VAE decoding and enables full‑chain gradient updates, delivering higher‑quality, preference‑aligned videos while dramatically cutting memory use and training time.

Key Contributions

  • Latent‑space reward modeling: Shows that existing video diffusion models can serve directly as reward models, eliminating the need for pixel‑space vision‑language models.
  • End‑to‑end preference optimization: Enables gradient back‑propagation through the entire denoising process, providing supervision from the very first diffusion step.
  • Efficiency gains: Achieves up to ~4× lower GPU memory consumption and ~3× faster training compared to traditional RGB‑based reward feedback learning (ReFL).
  • Human‑aligned improvements: Demonstrates measurable boosts in human preference scores on benchmark video generation tasks.
  • Comprehensive evaluation: Includes ablation studies, qualitative analyses, and runtime profiling to validate the approach.

Methodology

  1. Starting point – video diffusion models: The authors use pre‑trained video diffusion models (e.g., Video Diffusion, Video LDM) that already operate on noisy latent representations at any timestep.
  2. Preference data collection: Human annotators rank pairs of generated video clips according to criteria like motion smoothness, temporal coherence, and overall appeal.
  3. Latent‑space reward network: A lightweight neural head is attached to the diffusion model’s latent encoder. It takes the noisy latent at a chosen timestep and outputs a scalar “reward” predicting the human preference.
  4. Process Reward Feedback Learning (PRFL):
    • Sampling: For each training step, the model samples a noisy latent at a random diffusion step.
    • Reward prediction: The reward head scores the latent.
    • Loss: A pairwise ranking loss (e.g., Bradley‑Terry) pushes the higher‑rated video’s latent to receive a larger reward.
    • Back‑propagation: Because everything stays in latent space, gradients flow through the entire denoising chain back to the model’s parameters, updating both the diffusion backbone and the reward head.
  5. No VAE decoding: The pipeline never converts latents back to RGB during training, which removes the expensive VAE decode step that dominates memory and compute in prior ReFL methods.

Results & Findings

MetricBaseline (RGB‑ReFL)PRFL (Latent‑ReFL)Relative Change
Human Preference Score (↑)68.2 %74.9 %+9.8 %
GPU Memory (GB)23.55.8–75 %
Training Time per Epoch (hrs)12.44.1–67 %
FVD (lower is better)210165–21 %
  • Preference alignment: Users consistently preferred videos from PRFL, especially in dynamic scenes where motion continuity mattered.
  • Temporal fidelity: Qualitative examples show smoother transitions and fewer flickering artifacts compared to RGB‑ReFL.
  • Ablations: Removing the reward head from early diffusion steps hurts performance, confirming the benefit of early‑stage supervision.
  • Scalability: PRFL scales to higher‑resolution video (256×256) with modest GPU budgets, something infeasible with pixel‑space ReFL.

Practical Implications

  • Faster iteration for product teams: Developers building AI‑powered video editors, content‑creation tools, or generative ads can fine‑tune models on user feedback in days rather than weeks.
  • Lower infrastructure cost: The reduced memory footprint means training can run on single‑GPU workstations or cheaper cloud instances, opening up preference‑learning to smaller studios.
  • Better user experience: Early‑stage preference feedback leads to models that get motion right from the start, reducing the need for post‑generation polishing or manual correction.
  • Plug‑and‑play reward heads: Since PRFL only adds a small head to existing diffusion backbones, teams can retrofit their current pipelines without retraining from scratch.
  • Potential for multimodal feedback: The latent‑space approach can be extended to incorporate other signals (e.g., audio alignment, user interaction logs) without exploding compute.

Limitations & Future Work

  • Dependence on a strong pre‑trained diffusion backbone: PRFL’s gains assume the underlying video diffusion model already captures decent temporal dynamics.
  • Reward head simplicity: The current reward network is shallow; richer architectures (e.g., transformer‑based heads) might capture subtler preferences.
  • Human data bottleneck: Collecting high‑quality pairwise rankings remains costly; exploring synthetic or semi‑supervised preference signals is an open direction.
  • Generalization to very long videos: Experiments were limited to clips ≤2 seconds; scaling to longer sequences may require hierarchical latent representations.
  • Cross‑modal extensions: Future work could integrate text or audio cues directly into the latent reward, enabling more expressive preference specifications.

Authors

  • Xiaoyue Mi
  • Wenqing Yu
  • Jiesong Lian
  • Shibo Jie
  • Ruizhe Zhong
  • Zijun Liu
  • Guozhen Zhang
  • Zixiang Zhou
  • Zhiyong Xu
  • Yuan Zhou
  • Qinglin Lu
  • Fan Tang

Paper Information

  • arXiv ID: 2511.21541v1
  • Categories: cs.CV
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »