[Paper] RFDM: Residual Flow Diffusion Model for Efficient Causal Video Editing
Source: arXiv - 2602.06871v1
Overview
The paper “RFDM: Residual Flow Diffusion Model for Efficient Causal Video Editing” presents a new way to edit videos using plain‑text prompts while keeping compute costs comparable to image‑only diffusion models. By treating video editing as a frame‑by‑frame, causal process, the authors achieve variable‑length editing without the heavy 3‑D spatiotemporal networks that dominate current video‑diffusion work.
Key Contributions
- Causal V2V editing pipeline – edits each frame conditioned on the previous frame’s prediction, enabling arbitrary video lengths.
- Residual Flow Diffusion Model (RFDM) – a novel diffusion forward process that learns to predict residual changes (the “flow”) between consecutive frames rather than full frames, exploiting temporal redundancy.
- Efficient reuse of 2‑D image‑to‑image diffusion models – the architecture builds on existing image diffusion weights, avoiding the need to train massive 3‑D video models from scratch.
- New benchmark for instructional video editing – includes global/local style transfer and object removal tasks, with evaluation metrics that better reflect real‑world editing quality.
- Competitive performance – RFDM matches or exceeds state‑of‑the‑art image‑based editors and approaches fully spatiotemporal video models while using far less compute.
Methodology
- Base Model – Start with a pretrained 2‑D image‑to‑image diffusion model (e.g., Stable Diffusion).
- Causal Conditioning – When editing frame t, the model receives the denoised prediction of frame t‑1 as an additional conditioning input, turning the process into a causal chain.
- Residual Flow Diffusion
- Forward Process: Instead of adding Gaussian noise to the raw frame, the authors add noise to the difference (residual) between the target edited frame and the previous prediction.
- Reverse Process: The denoiser learns to reconstruct this residual, which is then added back to the previous frame’s prediction to obtain the edited frame t.
- This focuses learning on what changes between frames, dramatically reducing the amount of information the network must model at each step.
- Training Data – Paired video clips with ground‑truth edits for two tasks: (a) global/local style transfer, (b) object removal. The model learns to map a source video + text prompt → edited video.
- Inference – Given any length video and a textual instruction, the model iterates over frames, applying the residual diffusion step, and can stop at any point—making it truly variable‑length.
Results & Findings
| Metric / Task | Image‑to‑Image Diffusion | 3‑D Spatiotemporal V2V | RFDM (Ours) |
|---|---|---|---|
| Global style transfer (FID) | 38.2 | 31.5 | 30.8 |
| Local style transfer (LPIPS) | 0.42 | 0.35 | 0.34 |
| Object removal (mAP) | 0.61 | 0.68 | 0.66 |
| Compute (GPU‑hours per hour video) | 1× | 4× | 1.1× |
- Quality: RFDM consistently outperforms pure image‑based editors and narrows the gap to full 3‑D video models, especially on tasks that require precise temporal consistency (e.g., object removal).
- Efficiency: Because the model reuses 2‑D weights and only processes residuals, the per‑frame compute is almost identical to an image diffusion step, independent of video length.
- Scalability: Experiments show linear scaling with video duration—doubling the number of frames roughly doubles inference time, without any hidden quadratic cost typical of 3‑D convolutions.
Practical Implications
- Developer‑friendly APIs: Existing image diffusion libraries (e.g., Diffusers) can be extended with a few lines of code to support video editing, lowering the barrier to integrate RFDM into production pipelines.
- Real‑time or near‑real‑time editing: The causal, frame‑wise nature makes it possible to edit streaming video on‑the‑fly (e.g., live caption‑driven visual effects, AR filters).
- Cost‑effective content creation: Studios and SaaS platforms can offer text‑driven video editing services without investing in expensive 3‑D video models or massive GPU clusters.
- Fine‑grained control: By focusing on residual flow, developers can more easily combine multiple prompts (e.g., “change the sky to sunset” + “remove the billboard”) without the model “forgetting” earlier edits.
- Cross‑modal extensions: Because the backbone is a 2‑D diffusion model, any improvements in image diffusion (e.g., better samplers, LoRA adapters) immediately benefit video editing.
Limitations & Future Work
- Temporal coherence edge cases: While residual flow handles most smooth motions, rapid scene cuts or large object displacements can still produce flickering artifacts.
- Prompt granularity: The model assumes a single global text prompt per video; handling per‑frame or region‑specific prompts would require additional conditioning mechanisms.
- Training data bias: The paired video dataset focuses on style transfer and object removal; extending to more diverse editing operations (e.g., pose manipulation, background replacement) may need broader data.
- Future directions suggested by the authors include:
- Integrating optical‑flow priors to further stabilize fast motion.
- Exploring hierarchical conditioning for multi‑prompt editing.
- Scaling the residual diffusion to higher resolutions and longer sequences with memory‑efficient attention.
Authors
- Mohammadreza Salehi
- Mehdi Noroozi
- Luca Morreale
- Ruchika Chavhan
- Malcolm Chadwick
- Alberto Gil Ramos
- Abhinav Mehrotra
Paper Information
- arXiv ID: 2602.06871v1
- Categories: cs.CV
- Published: February 6, 2026
- PDF: Download PDF