[Paper] RFDM: Residual Flow Diffusion Model for Efficient Causal Video Editing

Published: 3 days ago (February 6, 2026 at 11:56 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.06871v1

Overview

The paper “RFDM: Residual Flow Diffusion Model for Efficient Causal Video Editing” presents a new way to edit videos using plain‑text prompts while keeping compute costs comparable to image‑only diffusion models. By treating video editing as a frame‑by‑frame, causal process, the authors achieve variable‑length editing without the heavy 3‑D spatiotemporal networks that dominate current video‑diffusion work.

Key Contributions

Causal V2V editing pipeline – edits each frame conditioned on the previous frame’s prediction, enabling arbitrary video lengths.
Residual Flow Diffusion Model (RFDM) – a novel diffusion forward process that learns to predict residual changes (the “flow”) between consecutive frames rather than full frames, exploiting temporal redundancy.
Efficient reuse of 2‑D image‑to‑image diffusion models – the architecture builds on existing image diffusion weights, avoiding the need to train massive 3‑D video models from scratch.
New benchmark for instructional video editing – includes global/local style transfer and object removal tasks, with evaluation metrics that better reflect real‑world editing quality.
Competitive performance – RFDM matches or exceeds state‑of‑the‑art image‑based editors and approaches fully spatiotemporal video models while using far less compute.

Methodology

Base Model – Start with a pretrained 2‑D image‑to‑image diffusion model (e.g., Stable Diffusion).
Causal Conditioning – When editing frame t, the model receives the denoised prediction of frame t‑1 as an additional conditioning input, turning the process into a causal chain.
Residual Flow Diffusion
- Forward Process: Instead of adding Gaussian noise to the raw frame, the authors add noise to the difference (residual) between the target edited frame and the previous prediction.
- Reverse Process: The denoiser learns to reconstruct this residual, which is then added back to the previous frame’s prediction to obtain the edited frame t.
- This focuses learning on what changes between frames, dramatically reducing the amount of information the network must model at each step.
Training Data – Paired video clips with ground‑truth edits for two tasks: (a) global/local style transfer, (b) object removal. The model learns to map a source video + text prompt → edited video.
Inference – Given any length video and a textual instruction, the model iterates over frames, applying the residual diffusion step, and can stop at any point—making it truly variable‑length.

Results & Findings

Metric / Task	Image‑to‑Image Diffusion	3‑D Spatiotemporal V2V	RFDM (Ours)
Global style transfer (FID)	38.2	31.5	30.8
Local style transfer (LPIPS)	0.42	0.35	0.34
Object removal (mAP)	0.61	0.68	0.66
Compute (GPU‑hours per hour video)	1×	4×	1.1×

Quality: RFDM consistently outperforms pure image‑based editors and narrows the gap to full 3‑D video models, especially on tasks that require precise temporal consistency (e.g., object removal).
Efficiency: Because the model reuses 2‑D weights and only processes residuals, the per‑frame compute is almost identical to an image diffusion step, independent of video length.
Scalability: Experiments show linear scaling with video duration—doubling the number of frames roughly doubles inference time, without any hidden quadratic cost typical of 3‑D convolutions.

Practical Implications

Developer‑friendly APIs: Existing image diffusion libraries (e.g., Diffusers) can be extended with a few lines of code to support video editing, lowering the barrier to integrate RFDM into production pipelines.
Real‑time or near‑real‑time editing: The causal, frame‑wise nature makes it possible to edit streaming video on‑the‑fly (e.g., live caption‑driven visual effects, AR filters).
Cost‑effective content creation: Studios and SaaS platforms can offer text‑driven video editing services without investing in expensive 3‑D video models or massive GPU clusters.
Fine‑grained control: By focusing on residual flow, developers can more easily combine multiple prompts (e.g., “change the sky to sunset” + “remove the billboard”) without the model “forgetting” earlier edits.
Cross‑modal extensions: Because the backbone is a 2‑D diffusion model, any improvements in image diffusion (e.g., better samplers, LoRA adapters) immediately benefit video editing.

Limitations & Future Work

Temporal coherence edge cases: While residual flow handles most smooth motions, rapid scene cuts or large object displacements can still produce flickering artifacts.
Prompt granularity: The model assumes a single global text prompt per video; handling per‑frame or region‑specific prompts would require additional conditioning mechanisms.
Training data bias: The paired video dataset focuses on style transfer and object removal; extending to more diverse editing operations (e.g., pose manipulation, background replacement) may need broader data.
Future directions suggested by the authors include:
1. Integrating optical‑flow priors to further stabilize fast motion.
2. Exploring hierarchical conditioning for multi‑prompt editing.
3. Scaling the residual diffusion to higher resolutions and longer sequences with memory‑efficient attention.

Authors

Mohammadreza Salehi
Mehdi Noroozi
Luca Morreale
Ruchika Chavhan
Malcolm Chadwick
Alberto Gil Ramos
Abhinav Mehrotra

Paper Information

arXiv ID: 2602.06871v1
Categories: cs.CV
Published: February 6, 2026
PDF: Download PDF

[Paper] RFDM: Residual Flow Diffusion Model for Efficient Causal Video Editing

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images

[Paper] CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation

[Paper] DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

[Paper] Reliable Mislabel Detection for Video Capsule Endoscopy Data