[Paper] FlexAM: Flexible Appearance-Motion Decomposition for Versatile Video Generation Control
Source: arXiv - 2602.13185v1
Overview
FlexAM tackles one of the toughest problems in generative AI: giving developers fine‑grained, yet intuitive, control over video synthesis. By cleanly separating appearance (what things look like) from motion (how they move), the framework lets you edit, remix, or generate videos with just a few high‑level signals—making video generation far more practical for real‑world products.
Key Contributions
- 3‑D control signal as a point‑cloud: Encodes the full spatio‑temporal dynamics of a video in a single, manipulable structure.
- Multi‑frequency positional encoding: Captures both coarse and subtle motion cues, enabling precise edits without sacrificing smoothness.
- Depth‑aware encoding: Incorporates scene geometry so that motion respects occlusions and perspective changes.
- Flexible precision‑quality trade‑off: A tunable control representation that lets users prioritize exact motion fidelity or higher visual quality on demand.
- Unified pipeline for diverse tasks: Handles image‑to‑video (I2V), video‑to‑video (V2V) editing, camera path control, and localized object manipulation within a single model.
Methodology
FlexAM builds on a diffusion‑based video generator but replaces the usual 2‑D conditioning (e.g., optical flow or keyframes) with a 3‑D point‑cloud control signal:
- Control Point Cloud Creation – For each frame, the method samples points in 3‑D space that encode pixel positions, depth, and time.
- Positional Encoding Layers –
- Multi‑frequency: Applies sinusoidal embeddings at several frequencies, allowing the network to differentiate fast, jittery motions from slow, sweeping gestures.
- Depth‑aware: Adds depth‑scaled embeddings so that points farther away receive a different signal, preserving correct parallax and occlusion.
- Appearance‑Motion Decoder – The diffusion model receives two streams: (a) a static appearance embedding derived from a reference image or frame, and (b) the dynamic control point cloud. The decoder learns to re‑compose these streams into coherent video frames.
- Flexibility Mechanism – A scalar weight can be tuned at inference time to bias the model toward either stricter adherence to the control points (high precision) or smoother, higher‑fidelity textures (high quality).
All components are trained end‑to‑end on large video datasets, but the control signal itself is task‑agnostic, meaning the same model can be reused for many downstream editing scenarios.
Results & Findings
| Task | Metric (higher = better) | FlexAM vs. Prior Art |
|---|---|---|
| I2V synthesis (FID) | 12.3 | ‑30 % improvement |
| V2V motion transfer (LPIPS) | 0.18 | ‑22 % reduction |
| Camera path editing (PSNR) | 28.7 dB | +3.5 dB |
| Local object edit (IoU) | 0.71 | +0.09 |
- Consistent quality across tasks: FlexAM outperformed specialized baselines even when those baselines were tuned for a single task.
- User study: 85 % of participants preferred FlexAM‑generated edits for realism and controllability.
- Ablation: Removing depth‑aware encoding caused a 15 % drop in motion consistency; dropping multi‑frequency encoding degraded fine‑grained motion fidelity by ~20 %.
Practical Implications
- Content creation pipelines: Video editors can now replace manual keyframing with a single point‑cloud sketch, dramatically speeding up motion retargeting and style transfer.
- AR/VR experiences: Developers can generate immersive video backdrops that react to user‑controlled camera rigs without re‑training per scene.
- Automated video personalization: Brands can inject product appearances into existing footage while preserving original motion, enabling mass‑customized ads.
- Game asset generation: Procedural animation pipelines can use FlexAM to synthesize realistic character motions from simple pose clouds, reducing reliance on motion‑capture data.
Limitations & Future Work
- Data‑intensive training: The model still requires large, diverse video corpora to learn robust appearance‑motion disentanglement.
- Control granularity: Extremely high‑frequency motions (e.g., fast‑moving particles) can be under‑represented if the point cloud density is low.
- Real‑time inference: Current diffusion sampling is not yet optimized for low‑latency applications; the authors suggest exploring accelerated samplers or distillation techniques.
Future research directions include extending the control signal to incorporate semantic cues (e.g., object labels), improving efficiency for on‑device deployment, and exploring cross‑modal conditioning such as audio‑driven motion control.
Authors
- Mingzhi Sheng
- Zekai Gu
- Peng Li
- Cheng Lin
- Hao‑Xiang Guo
- Ying‑Cong Chen
- Yuan Liu
Paper Information
- arXiv ID: 2602.13185v1
- Categories: cs.CV, cs.GR
- Published: February 13, 2026
- PDF: Download PDF