[Paper] Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Published: (March 2, 2026 at 01:46 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2603.02175v1

Overview

The paper presents Kiwi-Edit, a new framework that lets developers edit videos by combining natural‑language instructions with visual reference cues (e.g., an image or short clip). By generating a massive synthetic dataset (RefVIE) and a unified model architecture, the authors achieve far more precise and controllable video edits than prior instruction‑only methods.

Key Contributions

  • Scalable data pipeline that converts existing video‑editing pairs into high‑quality quadruplets (source video, instruction, reference image, edited video) using state‑of‑the‑art image generators.
  • RefVIE dataset: 200K training quadruplets covering diverse editing scenarios, released publicly.
  • RefVIE‑Bench: a comprehensive benchmark suite (automatic metrics + human evaluation) for instruction‑and‑reference video editing.
  • Kiwi‑Edit architecture: merges learnable query tokens (for textual instructions) with latent visual features extracted from reference images, enabling fine‑grained semantic guidance.
  • Progressive multi‑stage training that first learns instruction following, then refines reference fidelity, yielding large performance gains.

Methodology

  1. Data Generation

    • Start from publicly available video‑editing datasets (e.g., VGG‑Sound, DAVIS).
    • For each source‑target video pair, synthesize a reference scaffold by prompting a diffusion image model (Stable Diffusion) with the editing instruction.
    • The result is a quadruplet: (source video, textual instruction, reference image, edited video).
    • Automatic quality checks (CLIP similarity, motion consistency) filter out low‑fidelity samples, producing the RefVIE corpus.
  2. Model Architecture (Kiwi‑Edit)

    • Backbone: a video transformer encoder processes the source frames into spatio‑temporal tokens.
    • Instruction Encoder: a frozen language model (e.g., T5) generates learnable query embeddings that attend to the video tokens.
    • Reference Encoder: a CNN‑ViT hybrid extracts latent visual features from the reference image; these are injected as additional keys/values in the cross‑attention layers.
    • Decoder: a conditional diffusion model predicts the edited video frames, guided by both instruction queries and reference features.
  3. Training Curriculum

    • Stage 1 – Instruction‑only: train on RefVIE without reference conditioning to learn basic edit semantics.
    • Stage 2 – Reference‑aware: fine‑tune with the full quadruplets, gradually increasing the weight of reference loss (CLIP‑based similarity between generated frames and reference).
    • Stage 3 – Multi‑modal refinement: jointly optimize perceptual video quality (VMAF) and temporal consistency (flow‑based loss).

Results & Findings

Metric (higher is better)Instruction‑Only BaselineKiwi‑Edit (full)
CLIP‑Text↔Video similarity0.620.78
CLIP‑Image↔Video similarity (reference fidelity)0.480.71
FVD (lower is better)210112
Human preference (pairwise)32 %68 %
  • Instruction adherence improves by ~25 % (CLIP‑Text score).
  • Reference fidelity jumps by ~45 % (CLIP‑Image score), meaning the edited video visually matches the supplied reference far better than prior methods.
  • Temporal coherence remains strong thanks to the flow‑aware loss, with no noticeable flickering.

Ablation studies confirm that (1) the reference encoder contributes most to image similarity, and (2) the progressive curriculum outperforms end‑to‑end training by ~10 % on FVD.

Practical Implications

  • Content creation pipelines: Video editors can now specify what to change via text (e.g., “make the sky sunset‑orange”) and how it should look via a reference image, dramatically reducing manual key‑framing.
  • Rapid prototyping for AR/VR: Developers can generate scene variations on‑the‑fly by swapping reference assets, useful for game level design or virtual production.
  • Automated post‑production: Brands can enforce visual consistency across campaigns by providing a style reference; Kiwi‑Edit will adapt raw footage accordingly.
  • Open‑source ecosystem: With the code, dataset, and pretrained weights released, teams can fine‑tune the model on domain‑specific assets (e.g., medical imaging videos) without collecting massive paired data.

Limitations & Future Work

  • Reference quality dependence: The model assumes the reference image accurately captures the desired visual attribute; ambiguous or low‑resolution references degrade performance.
  • Computational cost: Training the diffusion decoder on full‑resolution video (1080p) remains memory‑intensive; inference currently runs at ~2 fps on a single A100.
  • Limited editing scope: While effective for color, texture, and object insertion, the system struggles with large‑scale geometric transformations (e.g., changing camera viewpoint).
  • Future directions suggested by the authors include:
    1. Integrating 3‑D reference cues (depth maps, point clouds) for spatially aware edits.
    2. Exploring lightweight transformer variants for real‑time deployment.
    3. Extending the pipeline to multi‑modal references (audio + visual).

Authors

  • Yiqi Lin
  • Guoqiang Liang
  • Ziyun Zeng
  • Zechen Bai
  • Yanzhe Chen
  • Mike Zheng Shou

Paper Information

  • arXiv ID: 2603.02175v1
  • Categories: cs.CV, cs.AI
  • Published: March 2, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »