[Paper] OpenVE-3M: A Large-Scale High-Quality Dataset for Instruction-Guided Video Editing
Source: arXiv - 2512.07826v1
Overview
The paper introduces OpenVE-3M, the first open‑source, large‑scale, high‑quality dataset specifically designed for instruction‑guided video editing. By covering a wide spectrum of edit types—from global style changes to precise object insertions—the dataset fills a critical gap that has limited progress on video‑editing AI models. The authors also release a benchmark (OpenVE‑Bench) and a 5‑billion‑parameter model (OpenVE‑Edit) that set new performance records on this benchmark.
Key Contributions
- OpenVE‑3M dataset: 3 million video‑edit pairs with human‑readable edit instructions, spanning 8 distinct edit categories (both spatially‑aligned and non‑aligned).
- Rigorous data pipeline: Automated generation, multi‑stage quality filtering, and human verification to ensure high visual fidelity and instruction relevance.
- OpenVE‑Bench: A curated benchmark of 431 video‑edit pairs with three evaluation metrics (temporal consistency, edit accuracy, and perceptual quality) that correlate strongly with human judgments.
- OpenVE‑Edit model: A 5 B parameter instruction‑guided video editing model trained on OpenVE‑3M, achieving state‑of‑the‑art results and outperforming a 14 B baseline from prior open‑source work.
- Open‑source release: All data, code, and model weights are publicly available, encouraging reproducibility and community‑driven extensions.
Methodology
-
Data Generation
- Start with a pool of royalty‑free, high‑resolution video clips.
- Apply a suite of deterministic video manipulation operators (e.g., color grading, background replacement, object insertion/removal, subtitle editing).
- For each manipulation, automatically synthesize a natural‑language instruction describing the desired edit.
-
Quality Filtering
- Automated checks: Detect visual artifacts, temporal jitter, and mismatched audio‑visual sync using pretrained perception models.
- Human review: A small team validates a random sample for instruction‑edit alignment, discarding outliers.
-
Benchmark Construction (OpenVE‑Bench)
- Sample a balanced subset covering all edit categories.
- Obtain three human‑rated scores per video: Temporal Consistency, Edit Accuracy, and Perceptual Quality.
- Derive composite metrics that align with these scores for automated evaluation.
-
Model Training (OpenVE‑Edit)
- Architecture: A diffusion‑based video generator conditioned on both the source video and the textual instruction.
- Training regime: 5 B parameters, mixed‑precision training on 64 A100 GPUs for ~2 weeks.
- Curriculum: Start with simpler global edits, gradually introduce more complex local and non‑aligned edits.
Results & Findings
| Metric (higher is better) | OpenVE‑Edit (5 B) | Prior Open‑Source 14 B | Human Upper Bound |
|---|---|---|---|
| Temporal Consistency (TC) | 0.84 | 0.78 | 0.92 |
| Edit Accuracy (EA) | 0.81 | 0.73 | 0.89 |
| Perceptual Quality (PQ) | 0.86 | 0.80 | 0.94 |
- OpenVE‑Edit outperforms the larger 14 B baseline across all three metrics, demonstrating that data quality and diversity can outweigh sheer model size.
- Human evaluation shows the model’s outputs are within ~10 % of the human upper bound, a notable achievement for a 5 B model.
- Ablation studies confirm that each edit category contributes uniquely; removing non‑spatially‑aligned edits drops overall performance by ~6 %.
Practical Implications
- Rapid prototyping of video effects: Developers can integrate OpenVE‑Edit into content‑creation pipelines to automatically apply style transfers, background swaps, or subtitle updates from plain text commands.
- Scalable video personalization: Marketing platforms can generate thousands of customized video ads (e.g., brand‑specific color palettes) without manual editing.
- Enhanced video‑editing tools: Existing desktop or cloud‑based editors can expose a “natural‑language edit” button, lowering the barrier for non‑technical creators.
- Research acceleration: OpenVE‑Bench provides a standardized yardstick, enabling fair comparison of future instruction‑guided video models.
- Cost‑effective deployment: Since the state‑of‑the‑art performance is achieved with a 5 B model, inference can run on a single high‑end GPU or even on optimized inference hardware, making SaaS offerings more affordable.
Limitations & Future Work
- Domain bias: The source videos are primarily royalty‑free clips; performance on highly cinematic or user‑generated content (e.g., shaky phone footage) may degrade.
- Instruction length: While the dataset includes longer prompts, extremely complex multi‑step instructions are under‑represented.
- Audio handling: Current pipeline focuses on visual edits; synchronized audio transformations (e.g., voice‑over replacement) are not covered.
- Real‑time editing: Inference latency is still on the order of seconds per short clip; achieving true real‑time editing remains an open challenge.
Future work could expand the dataset to cover diverse filming conditions, incorporate multimodal (audio‑visual) edit instructions, and explore model distillation techniques to further shrink latency without sacrificing quality.
Authors
- Haoyang He
- Jie Wang
- Jiangning Zhang
- Zhucun Xue
- Xingyuan Bu
- Qiangpeng Yang
- Shilei Wen
- Lei Xie
Paper Information
- arXiv ID: 2512.07826v1
- Categories: cs.CV
- Published: December 8, 2025
- PDF: Download PDF