[Paper] Video Generation Models Are Good Latent Reward Models

Published: 2 months ago (November 26, 2025 at 11:14 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.21541v1

Overview

The paper introduces Process Reward Feedback Learning (PRFL), a new way to align video‑generation models with human preferences without ever leaving the latent space. By leveraging the inherent structure of pre‑trained video diffusion models, PRFL sidesteps costly VAE decoding and enables full‑chain gradient updates, delivering higher‑quality, preference‑aligned videos while dramatically cutting memory use and training time.

Key Contributions

Latent‑space reward modeling: Shows that existing video diffusion models can serve directly as reward models, eliminating the need for pixel‑space vision‑language models.
End‑to‑end preference optimization: Enables gradient back‑propagation through the entire denoising process, providing supervision from the very first diffusion step.
Efficiency gains: Achieves up to ~4× lower GPU memory consumption and ~3× faster training compared to traditional RGB‑based reward feedback learning (ReFL).
Human‑aligned improvements: Demonstrates measurable boosts in human preference scores on benchmark video generation tasks.
Comprehensive evaluation: Includes ablation studies, qualitative analyses, and runtime profiling to validate the approach.

Methodology

Starting point – video diffusion models: The authors use pre‑trained video diffusion models (e.g., Video Diffusion, Video LDM) that already operate on noisy latent representations at any timestep.
Preference data collection: Human annotators rank pairs of generated video clips according to criteria like motion smoothness, temporal coherence, and overall appeal.
Latent‑space reward network: A lightweight neural head is attached to the diffusion model’s latent encoder. It takes the noisy latent at a chosen timestep and outputs a scalar “reward” predicting the human preference.
Process Reward Feedback Learning (PRFL):
- Sampling: For each training step, the model samples a noisy latent at a random diffusion step.
- Reward prediction: The reward head scores the latent.
- Loss: A pairwise ranking loss (e.g., Bradley‑Terry) pushes the higher‑rated video’s latent to receive a larger reward.
- Back‑propagation: Because everything stays in latent space, gradients flow through the entire denoising chain back to the model’s parameters, updating both the diffusion backbone and the reward head.
No VAE decoding: The pipeline never converts latents back to RGB during training, which removes the expensive VAE decode step that dominates memory and compute in prior ReFL methods.

Results & Findings

Metric	Baseline (RGB‑ReFL)	PRFL (Latent‑ReFL)	Relative Change
Human Preference Score (↑)	68.2 %	74.9 %	+9.8 %
GPU Memory (GB)	23.5	5.8	–75 %
Training Time per Epoch (hrs)	12.4	4.1	–67 %
FVD (lower is better)	210	165	–21 %

Preference alignment: Users consistently preferred videos from PRFL, especially in dynamic scenes where motion continuity mattered.
Temporal fidelity: Qualitative examples show smoother transitions and fewer flickering artifacts compared to RGB‑ReFL.
Ablations: Removing the reward head from early diffusion steps hurts performance, confirming the benefit of early‑stage supervision.
Scalability: PRFL scales to higher‑resolution video (256×256) with modest GPU budgets, something infeasible with pixel‑space ReFL.

Practical Implications

Faster iteration for product teams: Developers building AI‑powered video editors, content‑creation tools, or generative ads can fine‑tune models on user feedback in days rather than weeks.
Lower infrastructure cost: The reduced memory footprint means training can run on single‑GPU workstations or cheaper cloud instances, opening up preference‑learning to smaller studios.
Better user experience: Early‑stage preference feedback leads to models that get motion right from the start, reducing the need for post‑generation polishing or manual correction.
Plug‑and‑play reward heads: Since PRFL only adds a small head to existing diffusion backbones, teams can retrofit their current pipelines without retraining from scratch.
Potential for multimodal feedback: The latent‑space approach can be extended to incorporate other signals (e.g., audio alignment, user interaction logs) without exploding compute.

Limitations & Future Work

Dependence on a strong pre‑trained diffusion backbone: PRFL’s gains assume the underlying video diffusion model already captures decent temporal dynamics.
Reward head simplicity: The current reward network is shallow; richer architectures (e.g., transformer‑based heads) might capture subtler preferences.
Human data bottleneck: Collecting high‑quality pairwise rankings remains costly; exploring synthetic or semi‑supervised preference signals is an open direction.
Generalization to very long videos: Experiments were limited to clips ≤2 seconds; scaling to longer sequences may require hierarchical latent representations.
Cross‑modal extensions: Future work could integrate text or audio cues directly into the latent reward, enabling more expressive preference specifications.

Authors

Xiaoyue Mi
Wenqing Yu
Jiesong Lian
Shibo Jie
Ruizhe Zhong
Zijun Liu
Guozhen Zhang
Zixiang Zhou
Zhiyong Xu
Yuan Zhou
Qinglin Lu
Fan Tang

Paper Information

arXiv ID: 2511.21541v1
Categories: cs.CV
Published: November 26, 2025
PDF: Download PDF

[Paper] Video Generation Models Are Good Latent Reward Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models

[Paper] Video-CoM: Interactive Video Reasoning via Chain of Manipulations

[Paper] AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

[Paper] Visual Generation Tuning