[Paper] Choreographing a World of Dynamic Objects

Published: (January 7, 2026 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.04194v1

Overview

The paper introduces CHORD, a universal generative pipeline that can “choreograph” the motion of dynamic 3‑D objects over time—think of a system that can automatically synthesize realistic 4‑D (3‑D + time) scenes such as deforming cloth, colliding rigid bodies, or articulated robots. By leveraging recent video‑generation models and a novel distillation step, CHORD extracts the underlying physics‑style motion (Lagrangian) from ordinary 2‑D video footage, making it possible to generate diverse, category‑agnostic dynamics without hand‑crafted rules or massive labeled 3‑D datasets.

Key Contributions

  • Universal motion synthesis: A single framework that works across object categories (rigid, deformable, articulated) without needing per‑class heuristics.
  • Distillation from Eulerian to Lagrangian: Converts pixel‑level (Eulerian) video representations into object‑centric (Lagrangian) trajectories, preserving rich motion cues.
  • Category‑agnostic pipeline: No reliance on large, annotated 3‑D datasets; the system can be trained on readily available 2‑D video collections.
  • Demonstrated versatility: Generates multi‑body interactions, complex deformations, and even robot manipulation policies from the same backbone.
  • Open‑source release: Code, pretrained models, and a project page are provided for reproducibility and community extension.

Methodology

  1. Video‑generative backbone – CHORD starts with a state‑of‑the‑art 2‑D video diffusion model that learns to produce realistic pixel sequences from text or latent prompts.
  2. Eulerian‑to‑Lagrangian distillation – A secondary network is trained to map the generated video frames to a set of object‑centric trajectories (positions, orientations, deformation parameters). This step extracts the “motion script” hidden in the pixel data.
  3. Scene assembly – The distilled trajectories are fed into a lightweight physics‑inspired renderer that reconstructs the 3‑D geometry over time, allowing the system to output full 4‑D meshes or point clouds.
  4. Control knobs – Users can steer the generation via textual prompts, latent vectors, or explicit constraints (e.g., “make the ball bounce twice”). The same pipeline can be repurposed for downstream tasks such as generating robot action sequences.

The overall design keeps the heavy lifting (learning visual dynamics) in the 2‑D domain—where data is abundant—while the distillation step bridges the gap to 3‑D physics‑style representations.

Results & Findings

  • Diverse dynamics – CHORD successfully synthesizes realistic motions for rigid bodies (bouncing cubes), deformable objects (cloth draping, soft toys squishing), and articulated agents (humanoid walking).
  • Quantitative edge – Compared against prior rule‑based graphics pipelines and learning‑based 3‑D generators, CHORD achieves higher fidelity scores (e.g., lower Chamfer distance to ground‑truth meshes) while using 10‑× less labeled 3‑D data.
  • Robotics demo – By feeding the distilled trajectories into a simple motion‑planning module, the authors generate feasible manipulation policies for a simulated robot arm, showing that the motion scripts are physically plausible.
  • User study – Non‑expert participants rated CHORD‑generated videos as more “natural” and “coherent” than those from baseline methods, confirming the perceptual quality of the synthesized dynamics.

Practical Implications

  • Rapid prototyping for VFX & games – Artists can generate complex object interactions (e.g., collapsing structures, flowing fabrics) with a few textual prompts, cutting down on manual rigging and simulation setup.
  • Data augmentation for robotics – Simulated training data that includes realistic object dynamics can be produced on‑the‑fly, improving policy learning for manipulation and navigation tasks.
  • Cross‑domain content creation – Since the pipeline works with any 2‑D video source, developers can repurpose existing footage (e.g., sports clips) to create new 3‑D experiences for AR/VR.
  • Research tool – Researchers studying physical reasoning or embodied AI can use CHORD to generate controlled, diverse dynamic scenes without building bespoke simulators for each object type.

Limitations & Future Work

  • Physics fidelity – While motions look plausible, the underlying dynamics are not guaranteed to obey exact physical laws (e.g., conservation of momentum), limiting use in high‑precision engineering simulations.
  • Resolution & detail – The quality of fine‑grained deformation (e.g., cloth wrinkles) depends on the video backbone’s resolution; scaling up may require more compute.
  • Generalization to unseen physics – Extreme phenomena (explosions, fluid‑particle interactions) were not evaluated and may need additional conditioning.
  • Future directions – The authors plan to integrate explicit physics constraints during distillation, explore higher‑resolution video models, and extend the framework to multi‑modal inputs (audio, tactile cues) for richer scene synthesis.

Authors

  • Yanzhe Lyu
  • Chen Geng
  • Karthik Dharmarajan
  • Yunzhi Zhang
  • Hadi Alzayer
  • Shangzhe Wu
  • Jiajun Wu

Paper Information

  • arXiv ID: 2601.04194v1
  • Categories: cs.CV, cs.GR, cs.RO
  • Published: January 7, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »