[Paper] Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation
Source: arXiv - 2603.02190v1
Overview
Sketch2Colab is a new framework that turns simple 2‑D storyboard sketches into realistic, object‑aware 3‑D multi‑human animations. By combining a sketch‑driven diffusion prior with a fast “rectified‑flow” student model, it lets developers generate coordinated human‑object interactions while keeping tight control over agents, joints, timing, and contacts.
Key Contributions
- Sketch‑conditioned motion generation – Directly maps storyboard‑style sketches to 3‑D multi‑human motion without needing large amounts of paired motion data.
- Two‑stage diffusion‑to‑flow distillation – Learns a diffusion prior, then distills it into a latent‑space rectified‑flow student for rapid, stable sampling.
- Differentiable constraint energies – Integrates keyframe, trajectory, and physics‑based losses that steer the flow field to satisfy rich interaction constraints.
- CTMC‑based event planner – Introduces a continuous‑time Markov chain that schedules discrete events (touches, grasps, handoffs) to produce crisp, well‑phased collaborations.
- State‑of‑the‑art adherence & speed – Demonstrates higher constraint satisfaction and perceptual quality than diffusion‑only baselines while cutting inference time by an order of magnitude.
Methodology
-
Sketch‑driven diffusion prior – A conditional diffusion model is trained to predict latent motion representations from 2‑D sketches. The model learns the distribution of plausible multi‑human motions that respect the sketch’s spatial layout.
-
Rectified‑flow distillation – The diffusion prior is distilled into a deterministic flow model that operates in the same latent space. This “student” model learns a transport map that can generate samples in a single forward pass, dramatically speeding up inference.
-
Energy‑guided transport – Differentiable energy terms encode:
- Keyframe constraints (specific joint positions at given times)
- Trajectory constraints (desired paths for hands or objects)
- Physics constraints (collision avoidance, ground contact)
These energies are back‑propagated into the flow field, nudging generated motions toward satisfying the storyboard.
-
CTMC event planner – A continuous‑time Markov chain models discrete interaction events. The planner samples a sequence of event times and types (e.g., “hand‑over at t=1.2 s”), which modulate the flow dynamics, ensuring that multi‑agent actions are temporally aligned and physically plausible.
-
Latent‑space decoding – The final latent motion is decoded into full 3‑D skeletal trajectories and object poses, ready for rendering or downstream simulation.
Results & Findings
- Constraint adherence: On the CORE4D and InterHuman benchmarks, Sketch2Colab reduced keyframe error by ~35 % and increased contact‑accuracy (e.g., hand‑object touches) by ~28 % compared to diffusion‑only baselines.
- Perceptual quality: Human evaluators rated the generated animations 1.2 × more realistic on a 5‑point Likert scale.
- Inference speed: The rectified‑flow student generates a 5‑second multi‑human clip in ~120 ms on a single RTX 3090, versus ~1.5 s for the diffusion baseline.
- Robustness to multi‑entity conditioning: Even with 4 interacting agents and multiple objects, the system maintains stable sampling without mode collapse, a common failure mode in pure diffusion models.
Practical Implications
- Rapid prototyping for games and VR/AR: Designers can sketch a quick storyboard and instantly obtain a physically plausible multi‑character animation, cutting iteration cycles dramatically.
- Automated content generation pipelines: Studios can feed large libraries of 2‑D concept art into Sketch2Colab to bootstrap motion capture data, reducing reliance on expensive mocap sessions.
- Interactive robotics simulation: The CTMC planner’s event‑level control can be repurposed for simulating collaborative robot‑human tasks, where precise timing of handovers and grasps matters.
- AI‑assisted animation tools: Integration into existing tools (e.g., Blender, Unity) as a plug‑in would let artists refine sketches, adjust constraint weights, and instantly preview 3‑D motion.
Limitations & Future Work
- Sketch quality dependence: Extremely abstract or ambiguous sketches can lead to ambiguous motion hypotheses; the system currently assumes reasonably clear spatial cues.
- Fixed body topology: The model is trained on standard human skeletons; extending to non‑human avatars or highly stylized rigs would require additional data.
- Physics realism: While basic contact and collision constraints are enforced, fine‑grained dynamics (e.g., cloth simulation, soft‑body deformation) are not modeled.
- Future directions: The authors plan to incorporate learned physics simulators for richer dynamics, explore multimodal conditioning (e.g., audio cues), and open‑source a lightweight SDK for easy integration into production pipelines.
Authors
- Divyanshu Daiya
- Aniket Bera
Paper Information
- arXiv ID: 2603.02190v1
- Categories: cs.CV, cs.AI, cs.GR, cs.HC, cs.LG
- Published: March 2, 2026
- PDF: Download PDF