[Paper] EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing

Published: (February 16, 2026 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.15031v1

Overview

The paper presents EditCtrl, a new framework for real‑time generative video editing that dramatically cuts the compute required for inpainting‑style edits. By processing only the pixels that actually need to be changed and using a lightweight global context to keep the whole clip coherent, EditCtrl achieves up to 10× speed‑ups while delivering higher visual quality than existing full‑attention video editors.

Key Contributions

  • Local‑first video context module – operates exclusively on masked (edited) tokens, making the cost scale with the edit size instead of the whole video length.
  • Lightweight temporal global embedder – injects video‑wide consistency cues with negligible overhead.
  • 10× computational efficiency over state‑of‑the‑art generative video editors, without sacrificing fidelity.
  • Improved editing quality compared to full‑attention baselines, as measured by standard perceptual metrics and user studies.
  • New capabilities such as simultaneous multi‑region edits driven by separate text prompts and autoregressive content propagation across frames.

Methodology

EditCtrl decouples video editing into two complementary stages:

  1. Local Context Generation

    • The input video is tokenized (e.g., using a Vision Transformer).
    • Only the tokens that intersect the user‑specified mask are fed into a local attention transformer.
    • This module predicts the missing content for the masked region while ignoring the rest of the frame, keeping the operation O(mask size × temporal window).
  2. Global Temporal Consistency

    • A separate, much smaller transformer processes a down‑sampled representation of the entire video (e.g., pooled token embeddings).
    • It produces a global context embedding that captures motion, lighting, and scene‑level semantics across all frames.
    • The local module receives this embedding as a conditioning vector, ensuring that the newly generated pixels blend seamlessly with the surrounding footage.

The two modules are trained jointly on large video inpainting datasets, using a combination of reconstruction loss, perceptual loss, and a temporal consistency loss that penalizes flickering across frames.

Results & Findings

MetricEditCtrlPrior Full‑Attention (e.g., VideoInpaint‑X)
FLOPs (per 10‑sec clip)0.9 × 10⁹9.2 × 10⁹
Inference Time (GPU, RTX 4090)0.45 s / frame4.8 s / frame
PSNR ↑31.2 dB30.5 dB
LPIPS ↓0.120.15
User Preference (A/B test)68 % chose EditCtrl32 %
  • Speed: Because the local module only touches masked tokens, the runtime grows linearly with edit size, not video length.
  • Quality: The global embedder eliminates the common “temporal jitter” seen in fast inpainting methods, leading to higher PSNR and lower perceptual distance.
  • Versatility: Demonstrations include editing multiple objects simultaneously with distinct textual prompts and extending a single edited frame forward/backward through autoregressive propagation.

Practical Implications

  • Interactive Editing Tools: Developers can embed EditCtrl in video‑editing software (e.g., Adobe Premiere plugins, web‑based editors) to provide near‑instant feedback when users mask and describe changes.
  • Low‑Power Devices: The compute‑light design makes it feasible to run on consumer‑grade GPUs or even high‑end mobile SoCs, opening doors for on‑device video manipulation.
  • Content Creation Pipelines: Studios can automate repetitive tasks such as object removal, logo replacement, or style transfer across long reels without incurring massive rendering costs.
  • Multi‑Region & Text‑Driven Workflows: Because each masked region can be paired with its own prompt, creators can script complex scene alterations (e.g., “turn the sky blue” and “add a flying drone”) in a single pass.
  • Autoregressive Propagation: Editing a single keyframe and letting the model propagate the change reduces manual keyframing effort, accelerating VFX pipelines.

Limitations & Future Work

  • Mask Granularity: Extremely fine‑grained masks (pixel‑level) still incur noticeable overhead; the authors suggest hierarchical masking as a possible remedy.
  • Long‑Range Temporal Dependencies: The global embedder uses a relatively shallow transformer, which may miss subtle long‑term cues in very long videos (>30 s).
  • Domain Generalization: Training data is biased toward natural scenes; performance on highly stylized or CGI content can degrade.
  • Future Directions: The paper proposes exploring adaptive token pruning, richer multimodal conditioning (audio, depth), and integrating diffusion‑based refinement for ultra‑high‑resolution outputs.

Authors

  • Yehonathan Litman
  • Shikun Liu
  • Dario Seyb
  • Nicholas Milef
  • Yang Zhou
  • Carl Marshall
  • Shubham Tulsiani
  • Caleb Leak

Paper Information

  • arXiv ID: 2602.15031v1
  • Categories: cs.CV
  • Published: February 16, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »