[Paper] DisMo: Disentangled Motion Representations for Open-World Motion Transfer

Published: (November 28, 2025 at 01:25 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.23428v1

Overview

The paper introduces DisMo, a new way to learn motion representations that are completely disentangled from visual appearance. By training on raw video clips with a simple image‑space reconstruction loss, DisMo captures the essence of motion—independent of object shape, texture, or pose—making it possible to transfer that motion to any target content, even across wildly different categories. This opens up a more flexible, open‑world workflow for developers building text‑to‑video, image‑to‑video, or animation tools.

Key Contributions

  • Fully disentangled motion embeddings that separate dynamics from static visual cues (appearance, identity, pose).
  • Open‑world motion transfer: motion can be applied to semantically unrelated subjects without needing explicit correspondences.
  • Model‑agnostic adapters: the learned motion vectors can be plugged into any existing video generator (e.g., diffusion‑based T2V/I2V models) with minimal extra parameters.
  • State‑of‑the‑art zero‑shot action classification on benchmarks (Something‑Something v2, Jester), outperforming recent video representation models such as V‑JEPA.
  • Unified training objective (image‑space reconstruction) that avoids the complex adversarial or contrastive losses used in prior work.

Methodology

  1. Data & Objective – DisMo trains on uncurated video clips. For each clip, the model predicts the next frame given the current frame and a latent motion code. The loss is simply the pixel‑wise reconstruction error, encouraging the latent to capture everything needed to predict motion.
  2. Encoder‑Decoder Architecture
    • Motion Encoder: extracts a compact motion vector from a short frame sequence.
    • Content Encoder: separately encodes static appearance from a single reference frame.
    • Decoder: combines motion and content codes to reconstruct future frames.
  3. Disentanglement by Design – The content encoder’s parameters are frozen when training the motion encoder, forcing the motion branch to explain all temporal changes.
  4. Adapter Modules – Tiny neural adapters map DisMo’s motion vectors into the latent space of any downstream video generator (e.g., a diffusion model). This makes the approach plug‑and‑play: upgrade the video generator later, keep the same motion embeddings.
  5. Zero‑Shot Evaluation – The motion embeddings are directly fed to a linear classifier to test how well they capture action semantics without any fine‑tuning.

Results & Findings

  • Motion Transfer Quality – Qualitative demos show realistic transfer of actions like “dog jumping” onto a car, “human dancing” onto a cartoon character, and “object shaking” onto a completely different object class. The transferred videos retain the target’s appearance while faithfully reproducing the source motion.
  • Quantitative Metrics – Compared to prior motion‑transfer baselines, DisMo improves video‑FID scores by ~15 % and reduces motion drift (measured by optical‑flow consistency) by ~20 %.
  • Zero‑Shot Classification – On Something‑Something v2, DisMo’s motion embeddings achieve 68.3 % top‑1 accuracy, beating V‑JEPA’s 64.7 %. Similar gains are observed on the Jester dataset.
  • Adapter Efficiency – Adding adapters to a state‑of‑the‑art text‑to‑video diffusion model adds <0.5 M parameters (≈0.2 % of the base model) while preserving the model’s original generation quality.

Practical Implications

  • Content Creation Pipelines – Video editors can now extract a motion “style” from any clip and apply it to new assets (e.g., animating a 3D model with a real‑world dance without manual rigging).
  • Game Development – Procedural animation systems can reuse a library of motion embeddings to drive characters, props, or UI elements, reducing the need for hand‑crafted keyframes.
  • Augmented Reality & VFX – Real‑time motion transfer enables on‑the‑fly retargeting of live‑camera footage onto virtual avatars or objects, expanding interactive AR experiences.
  • Future‑Proof Integration – Because DisMo works via lightweight adapters, any improvements in underlying video generators (e.g., faster diffusion samplers, higher‑resolution models) can be leveraged instantly without retraining the motion encoder.
  • Action Understanding APIs – The motion embeddings can serve as compact descriptors for video search, recommendation, or automated moderation tools, offering a more semantically meaningful alternative to raw pixel or optical‑flow features.

Limitations & Future Work

  • Temporal Horizon – The current reconstruction loss focuses on short‑term prediction (next few frames). Long‑range dependencies (e.g., complex choreography) may degrade without additional temporal modeling.
  • Domain Gaps – While DisMo works across diverse categories, extreme visual domain shifts (e.g., medical imaging to cartoon) can still cause subtle artifacts, suggesting a need for domain‑adaptive fine‑tuning.
  • Real‑Time Constraints – The motion encoder is lightweight, but the downstream video generator (especially diffusion‑based) remains computationally heavy for real‑time applications.
  • Future Directions – The authors propose extending the framework to multi‑modal conditioning (audio‑driven motion), incorporating hierarchical motion codes for longer sequences, and exploring self‑supervised pre‑training on massive web video corpora to further improve zero‑shot understanding.

Authors

  • Thomas Ressler-Antal
  • Frank Fundel
  • Malek Ben Alaya
  • Stefan Andreas Baumann
  • Felix Krause
  • Ming Gui
  • Björn Ommer

Paper Information

  • arXiv ID: 2511.23428v1
  • Categories: cs.CV
  • Published: November 28, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »