[Paper] SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

Published: 22 hours ago (March 19, 2026 at 01:59 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2603.19228v1

Overview

Current instruction-guided video editing models struggle to simultaneously balance precise semantic modifications with faithful motion preservation. While existing approaches rely on injecting explicit external priors (e.g., VLM features or structural conditions) to mitigate these issues, this reliance severely bottlenecks model robustness and generalization. To overcome this limitation, we present SAMA (factorized Semantic Anchoring and Motion Alignment), a framework that factorizes video editing into semantic anchoring and motion modeling. First, we introduce Semantic Anchoring, which establishes a reliable visual anchor by jointly predicting semantic tokens and video latents at sparse anchor frames, enabling purely instruction-aware structural planning. Second, Motion Alignment pre-trains the same backbone on motion-centric video restoration pretext tasks (cube inpainting, speed perturbation, and tube shuffle), enabling the model to internalize temporal dynamics directly from raw videos. SAMA is optimized with a two-stage pipeline: a factorized pre-training stage that learns inherent semantic-motion representations without paired video-instruction editing data, followed by supervised fine-tuning on paired editing data. Remarkably, the factorized pre-training alone already yields strong zero-shot video editing ability, validating the proposed factorization. SAMA achieves state-of-the-art performance among open-source models and is competitive with leading commercial systems (e.g., Kling-Omni). Code, models, and datasets will be released.

Key Contributions

This paper presents research in the following areas:

cs.CV

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CV.

Authors

Xinyao Zhang
Wenkai Dong
Yuxin Song
Bo Fang
Qi Zhang
Jing Wang
Fan Chen
Hui Zhang
Haocheng Feng
Yu Lu
Hang Zhou
Chun Yuan
Jingdong Wang

Paper Information

arXiv ID: 2603.19228v1
Categories: cs.CV
Published: March 19, 2026
PDF: Download PDF

[Paper] SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

[Paper] Matryoshka Gaussian Splatting

[Paper] Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens

[Paper] MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction