[Paper] Controllable Dynamic 3D Shape Generation via 3D Trajectories and Text

Published: 1 week ago (June 3, 2026 at 01:58 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2606.05162v1

Overview

The paper presents T2Mo, a feed‑forward neural framework that can synthesize dynamic 3D shapes (e.g., animated meshes) from two intuitive inputs: a text description and a 3D trajectory that pins down how specific points on the object should move. By marrying semantic guidance (the text) with precise spatial control (the trajectory), T2Mo can generate motions that are both expressive and accurately follow user‑specified paths—something that pure text‑to‑animation models struggle to achieve.

Key Contributions

Dual‑modal conditioning: Introduces a novel way to combine natural‑language prompts with explicit 3D trajectory constraints for controllable motion synthesis.
Shape‑grounded trajectory embedding: A robust encoder that turns arbitrarily sparse or dense trajectory sets into a shape‑aware token sequence covering the whole mesh, enabling the model to handle any trajectory configuration.
End‑to‑end feed‑forward architecture: Generates dynamic meshes directly, avoiding costly iterative optimization or separate video‑generation pipelines.
Comprehensive evaluation: Provides quantitative metrics, qualitative visualizations, and user studies that show superior adherence to trajectories and higher semantic fidelity compared with text‑only and cascaded video‑based baselines.

Methodology

Input Representation
- Text Prompt: Tokenized with a pretrained language model (e.g., CLIP‑text).
- 3D Trajectory: A set of point‑wise paths ({(p_i(t))}) defined in object space; can be sparse (few control points) or dense (full surface trajectories).
Shape‑Grounded Trajectory Encoder
- Projects each trajectory point onto the underlying static mesh.
- Uses a graph‑based network (e.g., Graph Neural Network) to diffuse the sparse trajectory information across the whole surface, producing a trajectory token sequence that is aware of the object’s geometry.
Fusion Module
- Concatenates text tokens and trajectory tokens.
- Passes them through a transformer decoder that predicts per‑vertex displacement fields over a sequence of time steps, effectively “animating” the static mesh.
Mesh Decoder
- Applies the predicted displacements to the canonical mesh, yielding a series of animated meshes (dynamic 3D shape).
- The whole pipeline is fully differentiable and runs in a single forward pass, making it fast enough for interactive use.

Results & Findings

Trajectory Fidelity: On a custom metric measuring average point‑wise deviation from the supplied trajectories, T2Mo outperforms text‑only baselines by ≈30 % and cascaded video pipelines by ≈18 %.
Semantic Alignment: Human evaluators rated T2Mo’s motions as matching the textual description 4.2/5 on average, compared to 3.5 for the strongest baseline.
Expressiveness: The model can generate complex motions (e.g., “a bird flapping its wings while spiraling upward”) that respect both the high‑level intent and low‑level path constraints.
Speed: Inference takes ≈120 ms for a 30‑frame animation on an RTX 3090, enabling near‑real‑time prototyping.

Practical Implications

Game Development & VR/AR: Designers can quickly prototype character or object animations by writing a short description and sketching a few control paths, dramatically reducing the iteration cycle compared with hand‑keyframing.
Robotics Simulation: Engineers can define desired end‑effector trajectories and high‑level task semantics (e.g., “pick up the cup gently”) to generate realistic object motions for training simulators.
Content Creation Platforms: 3D asset marketplaces could offer “text‑plus‑trajectory” generation tools, allowing creators to produce custom animated assets on demand without deep animation expertise.
Data Augmentation: Synthetic dynamic meshes generated by T2Mo can enrich training sets for downstream tasks such as 3D action recognition or motion prediction.

Limitations & Future Work

Static Mesh Dependency: T2Mo assumes a pre‑existing canonical mesh; generating both geometry and motion jointly remains an open challenge.
Trajectory Ambiguity: Extremely sparse or contradictory trajectories can lead to unrealistic deformations; the authors suggest integrating physics‑based regularizers.
Scalability to High‑Resolution Meshes: While the current implementation works well on meshes up to ~10k vertices, handling millions of vertices will require hierarchical or point‑cloud‑based extensions.
Generalization to Unseen Object Categories: The model performs best on categories seen during training; future work could explore domain‑adaptive fine‑tuning or meta‑learning to broaden applicability.

Authors

Jaeyeong Kim
Ines Kim
Jahyeok Koo
Seungryong Kim

Paper Information

arXiv ID: 2606.05162v1
Categories: cs.CV
Published: June 3, 2026
PDF: Download PDF

[Paper] Controllable Dynamic 3D Shape Generation via 3D Trajectories and Text

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] UniSHARP: Universal Sharp Monocular View Synthesis

[Paper] MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

[Paper] Streaming Video Generation with Streaming Force Control

[Paper] Differences in Detection: Explainability Where it Matters