[Paper] RFDM: Residual Flow Diffusion Model for Efficient Causal Video Editing

Published: (February 6, 2026 at 11:56 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.06871v1

Overview

The paper “RFDM: Residual Flow Diffusion Model for Efficient Causal Video Editing” presents a new way to edit videos using plain‑text prompts while keeping compute costs comparable to image‑only diffusion models. By treating video editing as a frame‑by‑frame, causal process, the authors achieve variable‑length editing without the heavy 3‑D spatiotemporal networks that dominate current video‑diffusion work.

Key Contributions

  • Causal V2V editing pipeline – edits each frame conditioned on the previous frame’s prediction, enabling arbitrary video lengths.
  • Residual Flow Diffusion Model (RFDM) – a novel diffusion forward process that learns to predict residual changes (the “flow”) between consecutive frames rather than full frames, exploiting temporal redundancy.
  • Efficient reuse of 2‑D image‑to‑image diffusion models – the architecture builds on existing image diffusion weights, avoiding the need to train massive 3‑D video models from scratch.
  • New benchmark for instructional video editing – includes global/local style transfer and object removal tasks, with evaluation metrics that better reflect real‑world editing quality.
  • Competitive performance – RFDM matches or exceeds state‑of‑the‑art image‑based editors and approaches fully spatiotemporal video models while using far less compute.

Methodology

  1. Base Model – Start with a pretrained 2‑D image‑to‑image diffusion model (e.g., Stable Diffusion).
  2. Causal Conditioning – When editing frame t, the model receives the denoised prediction of frame t‑1 as an additional conditioning input, turning the process into a causal chain.
  3. Residual Flow Diffusion
    • Forward Process: Instead of adding Gaussian noise to the raw frame, the authors add noise to the difference (residual) between the target edited frame and the previous prediction.
    • Reverse Process: The denoiser learns to reconstruct this residual, which is then added back to the previous frame’s prediction to obtain the edited frame t.
    • This focuses learning on what changes between frames, dramatically reducing the amount of information the network must model at each step.
  4. Training Data – Paired video clips with ground‑truth edits for two tasks: (a) global/local style transfer, (b) object removal. The model learns to map a source video + text prompt → edited video.
  5. Inference – Given any length video and a textual instruction, the model iterates over frames, applying the residual diffusion step, and can stop at any point—making it truly variable‑length.

Results & Findings

Metric / TaskImage‑to‑Image Diffusion3‑D Spatiotemporal V2VRFDM (Ours)
Global style transfer (FID)38.231.530.8
Local style transfer (LPIPS)0.420.350.34
Object removal (mAP)0.610.680.66
Compute (GPU‑hours per hour video)1.1×
  • Quality: RFDM consistently outperforms pure image‑based editors and narrows the gap to full 3‑D video models, especially on tasks that require precise temporal consistency (e.g., object removal).
  • Efficiency: Because the model reuses 2‑D weights and only processes residuals, the per‑frame compute is almost identical to an image diffusion step, independent of video length.
  • Scalability: Experiments show linear scaling with video duration—doubling the number of frames roughly doubles inference time, without any hidden quadratic cost typical of 3‑D convolutions.

Practical Implications

  • Developer‑friendly APIs: Existing image diffusion libraries (e.g., Diffusers) can be extended with a few lines of code to support video editing, lowering the barrier to integrate RFDM into production pipelines.
  • Real‑time or near‑real‑time editing: The causal, frame‑wise nature makes it possible to edit streaming video on‑the‑fly (e.g., live caption‑driven visual effects, AR filters).
  • Cost‑effective content creation: Studios and SaaS platforms can offer text‑driven video editing services without investing in expensive 3‑D video models or massive GPU clusters.
  • Fine‑grained control: By focusing on residual flow, developers can more easily combine multiple prompts (e.g., “change the sky to sunset” + “remove the billboard”) without the model “forgetting” earlier edits.
  • Cross‑modal extensions: Because the backbone is a 2‑D diffusion model, any improvements in image diffusion (e.g., better samplers, LoRA adapters) immediately benefit video editing.

Limitations & Future Work

  • Temporal coherence edge cases: While residual flow handles most smooth motions, rapid scene cuts or large object displacements can still produce flickering artifacts.
  • Prompt granularity: The model assumes a single global text prompt per video; handling per‑frame or region‑specific prompts would require additional conditioning mechanisms.
  • Training data bias: The paired video dataset focuses on style transfer and object removal; extending to more diverse editing operations (e.g., pose manipulation, background replacement) may need broader data.
  • Future directions suggested by the authors include:
    1. Integrating optical‑flow priors to further stabilize fast motion.
    2. Exploring hierarchical conditioning for multi‑prompt editing.
    3. Scaling the residual diffusion to higher resolutions and longer sequences with memory‑efficient attention.

Authors

  • Mohammadreza Salehi
  • Mehdi Noroozi
  • Luca Morreale
  • Ruchika Chavhan
  • Malcolm Chadwick
  • Alberto Gil Ramos
  • Abhinav Mehrotra

Paper Information

  • arXiv ID: 2602.06871v1
  • Categories: cs.CV
  • Published: February 6, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »