[Paper] EasyV2V: 고품질 명령 기반 비디오 편집 프레임워크

발행: 1개월 전 (2025년 12월 19일 오전 03:59 GMT+9)

7 분 소요

원문: arXiv

Source: arXiv - 2512.16920v1

개요

The paper EasyV2V presents a surprisingly simple yet powerful framework for instruction‑based video editing. By cleverly re‑using existing image‑editing experts, leveraging pretrained text‑to‑video models, and introducing a unified mask‑based control scheme, the authors achieve high‑quality, temporally consistent edits that outperform both academic baselines and commercial tools.

주요 기여

Data‑centric recipe: 이미지‑편집 전문가, 단일 프레임 감독, 그리고 공유된 어파인 모션을 가진 의사‑쌍(pseudo‑pairs)으로부터 다양한 비디오‑편집 쌍을 구성하고, 촘촘히 캡션된 클립을 마이닝하여 학습 데이터를 풍부하게 합니다.
Lightweight model design: 사전 학습된 텍스트‑투‑비디오 확산 모델이 이미 편집 지식을 내포하고 있음을 보여주고, 작은 LoRA (Low‑Rank Adaptation) 레이어와 간단한 시퀀스 연결(concatenation)만으로 조건부 학습을 수행합니다.
Unified spatiotemporal control: 공간 마스크, 시간 마스크, 그리고 선택적인 레퍼런스 이미지를 모두 처리할 수 있는 단일 마스크 메커니즘을 도입하여, 비디오 + 텍스트, 비디오 + 마스크 + 텍스트, 비디오 + 마스크 + 레퍼런스 + 텍스트와 같은 유연한 입력 방식을 가능하게 합니다.
Transition supervision: 편집이 시간에 따라 어떻게 전개되어야 하는지를 모델이 이해하도록 학습시켜, 프레임 간 부드러움과 일관성을 향상시킵니다.
State‑of‑the‑art performance: 표준 벤치마크에서 동시 연구 및 주요 상용 비디오‑편집 서비스들을 능가하면서도 계산 효율성을 유지합니다.

Methodology

Data Generation
- Expert composition: Combine off‑the‑shelf image editors (e.g., Stable Diffusion Instruct‑Pix2Pix) with fast inverse models to synthesize before/after image pairs.
- Lifting to video: Apply the same edit to a single frame and propagate it across a clip using shared affine motion, creating pseudo video pairs without costly manual labeling.
- Dense caption mining: Crawl video datasets for clips that already have rich textual descriptions, turning them into natural instruction‑video pairs.
- Transition supervision: Add intermediate frames that gradually morph from source to target, teaching the network the temporal dynamics of edits.
Model Architecture
- Start from a pretrained text‑to‑video diffusion model (e.g., Stable Diffusion Video).
- Append a LoRA module (a few thousand trainable parameters) to adapt the model to the editing task.
- Condition the diffusion process by concatenating the source video frames, optional mask, reference image, and the instruction text into a single sequence token stream.
Control Mechanism
- A single binary mask indicates which pixels (and optionally which time steps) should be altered.
- When a reference image is supplied, the mask also guides where the reference content should be injected.
Training
- Use the constructed video pairs and transition frames.
- Optimize the LoRA parameters with a modest budget (≈ 1‑2 GPU days on a single A100).

결과 및 발견

Metric (on standard video‑editing benchmarks)	EasyV2V	Prior SOTA	Commercial Tool
CLIP‑Score (semantic fidelity)	0.84	0.78	0.71
FVD (temporal consistency)	210	340	420
User preference (pairwise)	71 %	29 %	—

높은 의미 정렬: 편집된 비디오는 기준선보다 텍스트 지시와 더 가깝게 일치합니다.
향상된 시간적 부드러움: 낮은 FVD는 깜박임 아티팩트가 적고 움직임이 더 일관됨을 나타냅니다.
인간 연구: 70 % 이상의 참가자가 경쟁 방법보다 EasyV2V 출력물을 선호했습니다.

정성적 예시(예: “주행 중인 차량을 유지하면서 낮 거리를 밤으로 바꾸기”)는 선명한 객체 변화, 일관된 조명 전환, 그리고 부드러운 전환을 보여줍니다.

실용적 시사점

Content creation pipelines: 비디오 편집자는 이제 자연어와 선택적 마스크를 사용해 편집을 스크립트화할 수 있어, 수동 키프레이밍을 크게 줄일 수 있습니다.
Rapid prototyping for AR/VR: 개발자는 전체 자산을 다시 렌더링하지 않고도 (예: “눈 추가”)와 같은 변형 씬을 실시간으로 생성할 수 있습니다.
E‑learning and marketing: 자동화된 비디오 개인화(브랜드 색상, 제품 오버레이)는 몇 줄의 명령만으로도 가능해집니다.
Low compute footprint: LoRA 레이어만 미세 조정하면 되므로, 기업은 대규모 GPU 클러스터 없이도 모델을 도메인 특화 어휘(예: 의료 영상)에 맞게 적용할 수 있습니다.

제한 사항 및 향후 작업

편집 범위: 프레임워크는 전역 스타일 또는 객체‑수준 변경에 뛰어나지만, 매우 상세한 기하학 수정(예: 정밀한 얼굴 재연)에는 어려움을 겪는다.
마스크 세분성: 단일 마스크가 많은 경우에 작동하지만, 복잡한 다중‑객체 편집은 계층적 마스킹이 필요할 수 있으며, 이는 아직 지원되지 않는다.
데이터셋 편향: 학습 데이터는 기존 이미지‑편집 모델에서 파생되었으며, 그들의 편향 및 실패 모드를 물려받을 가능성이 있다.
향후 방향: 저자들은 3‑D‑인식 비디오 편집으로 확장하고, 깊이 단서를 통합하여 차폐 처리 개선 및 최종 사용자를 위한 인터랙티브 마스크 정제 도구 탐색을 제안한다.

저자

Jinjie Mai
Chaoyang Wang
Guocheng Gordon Qian
Willi Menapace
Sergey Tulyakov
Bernard Ghanem
Peter Wonka
Ashkan Mirzaei

논문 정보

arXiv ID: 2512.16920v1
카테고리: cs.CV, cs.AI
출판일: 2025년 12월 18일
PDF: PDF 다운로드

[Paper] EasyV2V: 고품질 명령 기반 비디오 편집 프레임워크

개요

주요 기여

Methodology

결과 및 발견

실용적 시사점

제한 사항 및 향후 작업

저자

논문 정보

관련 글

[Paper] Re-Depth Anything: 테스트 시 자기지도 재조명을 통한 깊이 정제

[Paper] Open Foundation Models에서 Vision의 적대적 견고성

[Paper] RadarGen: 카메라에서 자동차 레이더 포인트 클라우드 생성

[Paper] Visually Prompted 벤치마크는 놀라울 정도로 취약하다