[Paper] EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing
Source: arXiv - 2602.15031v1
Overview
The paper presents EditCtrl, a new framework for real‑time generative video editing that dramatically cuts the compute required for inpainting‑style edits. By processing only the pixels that actually need to be changed and using a lightweight global context to keep the whole clip coherent, EditCtrl achieves up to 10× speed‑ups while delivering higher visual quality than existing full‑attention video editors.
Key Contributions
- Local‑first video context module – operates exclusively on masked (edited) tokens, making the cost scale with the edit size instead of the whole video length.
- Lightweight temporal global embedder – injects video‑wide consistency cues with negligible overhead.
- 10× computational efficiency over state‑of‑the‑art generative video editors, without sacrificing fidelity.
- Improved editing quality compared to full‑attention baselines, as measured by standard perceptual metrics and user studies.
- New capabilities such as simultaneous multi‑region edits driven by separate text prompts and autoregressive content propagation across frames.
Methodology
EditCtrl decouples video editing into two complementary stages:
-
Local Context Generation
- The input video is tokenized (e.g., using a Vision Transformer).
- Only the tokens that intersect the user‑specified mask are fed into a local attention transformer.
- This module predicts the missing content for the masked region while ignoring the rest of the frame, keeping the operation O(mask size × temporal window).
-
Global Temporal Consistency
- A separate, much smaller transformer processes a down‑sampled representation of the entire video (e.g., pooled token embeddings).
- It produces a global context embedding that captures motion, lighting, and scene‑level semantics across all frames.
- The local module receives this embedding as a conditioning vector, ensuring that the newly generated pixels blend seamlessly with the surrounding footage.
The two modules are trained jointly on large video inpainting datasets, using a combination of reconstruction loss, perceptual loss, and a temporal consistency loss that penalizes flickering across frames.
Results & Findings
| Metric | EditCtrl | Prior Full‑Attention (e.g., VideoInpaint‑X) |
|---|---|---|
| FLOPs (per 10‑sec clip) | 0.9 × 10⁹ | 9.2 × 10⁹ |
| Inference Time (GPU, RTX 4090) | 0.45 s / frame | 4.8 s / frame |
| PSNR ↑ | 31.2 dB | 30.5 dB |
| LPIPS ↓ | 0.12 | 0.15 |
| User Preference (A/B test) | 68 % chose EditCtrl | 32 % |
- Speed: Because the local module only touches masked tokens, the runtime grows linearly with edit size, not video length.
- Quality: The global embedder eliminates the common “temporal jitter” seen in fast inpainting methods, leading to higher PSNR and lower perceptual distance.
- Versatility: Demonstrations include editing multiple objects simultaneously with distinct textual prompts and extending a single edited frame forward/backward through autoregressive propagation.
Practical Implications
- Interactive Editing Tools: Developers can embed EditCtrl in video‑editing software (e.g., Adobe Premiere plugins, web‑based editors) to provide near‑instant feedback when users mask and describe changes.
- Low‑Power Devices: The compute‑light design makes it feasible to run on consumer‑grade GPUs or even high‑end mobile SoCs, opening doors for on‑device video manipulation.
- Content Creation Pipelines: Studios can automate repetitive tasks such as object removal, logo replacement, or style transfer across long reels without incurring massive rendering costs.
- Multi‑Region & Text‑Driven Workflows: Because each masked region can be paired with its own prompt, creators can script complex scene alterations (e.g., “turn the sky blue” and “add a flying drone”) in a single pass.
- Autoregressive Propagation: Editing a single keyframe and letting the model propagate the change reduces manual keyframing effort, accelerating VFX pipelines.
Limitations & Future Work
- Mask Granularity: Extremely fine‑grained masks (pixel‑level) still incur noticeable overhead; the authors suggest hierarchical masking as a possible remedy.
- Long‑Range Temporal Dependencies: The global embedder uses a relatively shallow transformer, which may miss subtle long‑term cues in very long videos (>30 s).
- Domain Generalization: Training data is biased toward natural scenes; performance on highly stylized or CGI content can degrade.
- Future Directions: The paper proposes exploring adaptive token pruning, richer multimodal conditioning (audio, depth), and integrating diffusion‑based refinement for ultra‑high‑resolution outputs.
Authors
- Yehonathan Litman
- Shikun Liu
- Dario Seyb
- Nicholas Milef
- Yang Zhou
- Carl Marshall
- Shubham Tulsiani
- Caleb Leak
Paper Information
- arXiv ID: 2602.15031v1
- Categories: cs.CV
- Published: February 16, 2026
- PDF: Download PDF