[Paper] The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

Published: (December 18, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.16924v1

Overview

The paper introduces WorldCanvas, a new framework that lets users “paint” dynamic video scenes by mixing three intuitive inputs: natural‑language prompts, motion trajectories, and reference images. By fusing these modalities, the system can generate coherent, controllable video events—think multi‑agent interactions, objects that appear or disappear on cue, or even physically implausible actions—while preserving object identity and scene consistency throughout the clip.

Key Contributions

  • Multimodal Prompting Engine – Combines text, 2‑D/3‑D trajectories, and reference images into a single, unified control signal for video synthesis.
  • Trajectory‑Driven Motion Encoding – Introduces a compact representation that captures where, when, and how objects move, including visibility flags for entry/exit.
  • Reference‑Guided Appearance – Uses exemplar images to lock down the visual style and identity of generated objects, enabling fine‑grained control over look and texture.
  • Consistent World Modeling – Demonstrates emergent temporal consistency: objects retain their identity and scene layout even after temporary occlusions or “magical” disappearances.
  • Open‑source Demo & Dataset – Provides a project page with code, pretrained models, and a curated set of prompt‑trajectory‑image triples for reproducibility.

Methodology

WorldCanvas builds on a diffusion‑based video generator but augments it with two novel conditioning streams:

  1. Trajectory Conditioning – Each moving entity is described by a sequence of (x, y) coordinates plus a visibility flag per frame. This trajectory is embedded via a small transformer that injects motion cues directly into the diffusion latent space.
  2. Reference Image Conditioning – A single image of the target object is passed through a pretrained vision encoder (e.g., CLIP‑ViT). Its embedding is fused with the text embedding, ensuring the generated object matches the supplied visual style.

During training, the model sees paired data: a short video clip, the corresponding textual description, the ground‑truth trajectories (extracted via off‑the‑shelf trackers), and a reference frame sampled from the clip. The loss combines the standard diffusion denoising objective with auxiliary alignment terms that penalize drift from the provided trajectories and reference appearance.

At inference time, developers can supply any combination of the three prompts, and the model will synthesize a video that respects all constraints.

Results & Findings

  • Qualitative: The generated videos exhibit smooth motion that follows the supplied paths, accurate object textures matching the reference images, and consistent scene layout even when objects temporarily vanish.
  • Quantitative: On a held‑out benchmark, WorldCanvas improves trajectory adherence (measured by average endpoint error) by ≈30 % over text‑only baselines, and boosts appearance fidelity (measured by LPIPS against reference frames) by ≈22 %.
  • User Study: In a 30‑participant evaluation, 78 % of users rated WorldCanvas outputs as “more controllable” than existing text‑to‑video tools, and 65 % found the multimodal prompts “intuitive for creative prototyping.”

Practical Implications

  • Rapid Prototyping for Games & AR/VR – Designers can script character motions, object spawns, and visual styles without writing code or hand‑animating assets.
  • Automated Content Generation – Marketing teams could generate short product demos by feeding a product photo (reference) and a simple storyboard (trajectory + caption).
  • Simulation & Training – Robotics researchers can create synthetic video scenarios with precise motion patterns and visual cues for domain‑randomized training.
  • Creative Tools – Artists can experiment with “impossible” physics (e.g., objects moving against gravity) by simply adjusting trajectory timing, opening new avenues for visual storytelling.

Limitations & Future Work

  • Scalability of Trajectories – Current implementation handles a modest number of agents (≈5) before inference time grows noticeably; scaling to crowded scenes remains an open challenge.
  • Resolution & Duration – Generated videos are limited to 256 × 256 px and ~3 seconds; higher‑resolution, longer clips will require more efficient diffusion backbones.
  • Generalization to Unseen Objects – While reference images guide appearance, the model sometimes struggles with objects that differ drastically from training data (e.g., exotic wildlife).
  • Future Directions – The authors plan to integrate hierarchical scene graphs for better multi‑object coordination, explore latent‑space upscaling for HD output, and open a community benchmark for multimodal video synthesis.

Authors

  • Hanlin Wang
  • Hao Ouyang
  • Qiuyu Wang
  • Yue Yu
  • Yihao Meng
  • Wen Wang
  • Ka Leong Cheng
  • Shuailei Ma
  • Qingyan Bai
  • Yixuan Li
  • Cheng Chen
  • Yanhong Zeng
  • Xing Zhu
  • Yujun Shen
  • Qifeng Chen

Paper Information

  • arXiv ID: 2512.16924v1
  • Categories: cs.CV
  • Published: December 18, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] Dexterous World Models

Recent progress in 3D reconstruction has made it easy to create realistic digital twins from everyday environments. However, current digital twins remain largel...