[Paper] FlexAM: Flexible Appearance-Motion Decomposition for Versatile Video Generation Control

Published: 3 days ago (February 13, 2026 at 01:52 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.13185v1

Overview

FlexAM tackles one of the toughest problems in generative AI: giving developers fine‑grained, yet intuitive, control over video synthesis. By cleanly separating appearance (what things look like) from motion (how they move), the framework lets you edit, remix, or generate videos with just a few high‑level signals—making video generation far more practical for real‑world products.

Key Contributions

3‑D control signal as a point‑cloud: Encodes the full spatio‑temporal dynamics of a video in a single, manipulable structure.
Multi‑frequency positional encoding: Captures both coarse and subtle motion cues, enabling precise edits without sacrificing smoothness.
Depth‑aware encoding: Incorporates scene geometry so that motion respects occlusions and perspective changes.
Flexible precision‑quality trade‑off: A tunable control representation that lets users prioritize exact motion fidelity or higher visual quality on demand.
Unified pipeline for diverse tasks: Handles image‑to‑video (I2V), video‑to‑video (V2V) editing, camera path control, and localized object manipulation within a single model.

Methodology

FlexAM builds on a diffusion‑based video generator but replaces the usual 2‑D conditioning (e.g., optical flow or keyframes) with a 3‑D point‑cloud control signal:

Control Point Cloud Creation – For each frame, the method samples points in 3‑D space that encode pixel positions, depth, and time.
Positional Encoding Layers –
- Multi‑frequency: Applies sinusoidal embeddings at several frequencies, allowing the network to differentiate fast, jittery motions from slow, sweeping gestures.
- Depth‑aware: Adds depth‑scaled embeddings so that points farther away receive a different signal, preserving correct parallax and occlusion.
Appearance‑Motion Decoder – The diffusion model receives two streams: (a) a static appearance embedding derived from a reference image or frame, and (b) the dynamic control point cloud. The decoder learns to re‑compose these streams into coherent video frames.
Flexibility Mechanism – A scalar weight can be tuned at inference time to bias the model toward either stricter adherence to the control points (high precision) or smoother, higher‑fidelity textures (high quality).

All components are trained end‑to‑end on large video datasets, but the control signal itself is task‑agnostic, meaning the same model can be reused for many downstream editing scenarios.

Results & Findings

Task	Metric (higher = better)	FlexAM vs. Prior Art
I2V synthesis (FID)	12.3	‑30 % improvement
V2V motion transfer (LPIPS)	0.18	‑22 % reduction
Camera path editing (PSNR)	28.7 dB	+3.5 dB
Local object edit (IoU)	0.71	+0.09

Consistent quality across tasks: FlexAM outperformed specialized baselines even when those baselines were tuned for a single task.
User study: 85 % of participants preferred FlexAM‑generated edits for realism and controllability.
Ablation: Removing depth‑aware encoding caused a 15 % drop in motion consistency; dropping multi‑frequency encoding degraded fine‑grained motion fidelity by ~20 %.

Practical Implications

Content creation pipelines: Video editors can now replace manual keyframing with a single point‑cloud sketch, dramatically speeding up motion retargeting and style transfer.
AR/VR experiences: Developers can generate immersive video backdrops that react to user‑controlled camera rigs without re‑training per scene.
Automated video personalization: Brands can inject product appearances into existing footage while preserving original motion, enabling mass‑customized ads.
Game asset generation: Procedural animation pipelines can use FlexAM to synthesize realistic character motions from simple pose clouds, reducing reliance on motion‑capture data.

Limitations & Future Work

Data‑intensive training: The model still requires large, diverse video corpora to learn robust appearance‑motion disentanglement.
Control granularity: Extremely high‑frequency motions (e.g., fast‑moving particles) can be under‑represented if the point cloud density is low.
Real‑time inference: Current diffusion sampling is not yet optimized for low‑latency applications; the authors suggest exploring accelerated samplers or distillation techniques.

Future research directions include extending the control signal to incorporate semantic cues (e.g., object labels), improving efficiency for on‑device deployment, and exploring cross‑modal conditioning such as audio‑driven motion control.

Authors

Mingzhi Sheng
Zekai Gu
Peng Li
Cheng Lin
Hao‑Xiang Guo
Ying‑Cong Chen
Yuan Liu

Paper Information

arXiv ID: 2602.13185v1
Categories: cs.CV, cs.GR
Published: February 13, 2026
PDF: Download PDF

[Paper] FlexAM: Flexible Appearance-Motion Decomposition for Versatile Video Generation Control

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos

[Paper] Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision

[Paper] CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

[Paper] Monocular Markerless Motion Capture Enables Quantitative Assessment of Upper Extremity Reachable Workspace