[Paper] EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing

Published: 3 days ago (February 16, 2026 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.15031v1

Overview

The paper presents EditCtrl, a new framework for real‑time generative video editing that dramatically cuts the compute required for inpainting‑style edits. By processing only the pixels that actually need to be changed and using a lightweight global context to keep the whole clip coherent, EditCtrl achieves up to 10× speed‑ups while delivering higher visual quality than existing full‑attention video editors.

Key Contributions

Local‑first video context module – operates exclusively on masked (edited) tokens, making the cost scale with the edit size instead of the whole video length.
Lightweight temporal global embedder – injects video‑wide consistency cues with negligible overhead.
10× computational efficiency over state‑of‑the‑art generative video editors, without sacrificing fidelity.
Improved editing quality compared to full‑attention baselines, as measured by standard perceptual metrics and user studies.
New capabilities such as simultaneous multi‑region edits driven by separate text prompts and autoregressive content propagation across frames.

Methodology

EditCtrl decouples video editing into two complementary stages:

Local Context Generation
- The input video is tokenized (e.g., using a Vision Transformer).
- Only the tokens that intersect the user‑specified mask are fed into a local attention transformer.
- This module predicts the missing content for the masked region while ignoring the rest of the frame, keeping the operation O(mask size × temporal window).
Global Temporal Consistency
- A separate, much smaller transformer processes a down‑sampled representation of the entire video (e.g., pooled token embeddings).
- It produces a global context embedding that captures motion, lighting, and scene‑level semantics across all frames.
- The local module receives this embedding as a conditioning vector, ensuring that the newly generated pixels blend seamlessly with the surrounding footage.

The two modules are trained jointly on large video inpainting datasets, using a combination of reconstruction loss, perceptual loss, and a temporal consistency loss that penalizes flickering across frames.

Results & Findings

Metric	EditCtrl	Prior Full‑Attention (e.g., VideoInpaint‑X)
FLOPs (per 10‑sec clip)	0.9 × 10⁹	9.2 × 10⁹
Inference Time (GPU, RTX 4090)	0.45 s / frame	4.8 s / frame
PSNR ↑	31.2 dB	30.5 dB
LPIPS ↓	0.12	0.15
User Preference (A/B test)	68 % chose EditCtrl	32 %

Speed: Because the local module only touches masked tokens, the runtime grows linearly with edit size, not video length.
Quality: The global embedder eliminates the common “temporal jitter” seen in fast inpainting methods, leading to higher PSNR and lower perceptual distance.
Versatility: Demonstrations include editing multiple objects simultaneously with distinct textual prompts and extending a single edited frame forward/backward through autoregressive propagation.

Practical Implications

Interactive Editing Tools: Developers can embed EditCtrl in video‑editing software (e.g., Adobe Premiere plugins, web‑based editors) to provide near‑instant feedback when users mask and describe changes.
Low‑Power Devices: The compute‑light design makes it feasible to run on consumer‑grade GPUs or even high‑end mobile SoCs, opening doors for on‑device video manipulation.
Content Creation Pipelines: Studios can automate repetitive tasks such as object removal, logo replacement, or style transfer across long reels without incurring massive rendering costs.
Multi‑Region & Text‑Driven Workflows: Because each masked region can be paired with its own prompt, creators can script complex scene alterations (e.g., “turn the sky blue” and “add a flying drone”) in a single pass.
Autoregressive Propagation: Editing a single keyframe and letting the model propagate the change reduces manual keyframing effort, accelerating VFX pipelines.

Limitations & Future Work

Mask Granularity: Extremely fine‑grained masks (pixel‑level) still incur noticeable overhead; the authors suggest hierarchical masking as a possible remedy.
Long‑Range Temporal Dependencies: The global embedder uses a relatively shallow transformer, which may miss subtle long‑term cues in very long videos (>30 s).
Domain Generalization: Training data is biased toward natural scenes; performance on highly stylized or CGI content can degrade.
Future Directions: The paper proposes exploring adaptive token pruning, richer multimodal conditioning (audio, depth), and integrating diffusion‑based refinement for ultra‑high‑resolution outputs.

Authors

Yehonathan Litman
Shikun Liu
Dario Seyb
Nicholas Milef
Yang Zhou
Carl Marshall
Shubham Tulsiani
Caleb Leak

Paper Information

arXiv ID: 2602.15031v1
Categories: cs.CV
Published: February 16, 2026
PDF: Download PDF

[Paper] EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] TeCoNeRV: Leveraging Temporal Coherence for Compressible Neural Representations for Videos

[Paper] Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation

[Paper] Saliency-Aware Multi-Route Thinking: Revisiting Vision-Language Reasoning

[Paper] Are Object-Centric Representations Better At Compositional Generalization?