[Paper] Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation

Published: 1 day ago (March 2, 2026 at 01:52 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2603.02190v1

Overview

Sketch2Colab is a new framework that turns simple 2‑D storyboard sketches into realistic, object‑aware 3‑D multi‑human animations. By combining a sketch‑driven diffusion prior with a fast “rectified‑flow” student model, it lets developers generate coordinated human‑object interactions while keeping tight control over agents, joints, timing, and contacts.

Key Contributions

Sketch‑conditioned motion generation – Directly maps storyboard‑style sketches to 3‑D multi‑human motion without needing large amounts of paired motion data.
Two‑stage diffusion‑to‑flow distillation – Learns a diffusion prior, then distills it into a latent‑space rectified‑flow student for rapid, stable sampling.
Differentiable constraint energies – Integrates keyframe, trajectory, and physics‑based losses that steer the flow field to satisfy rich interaction constraints.
CTMC‑based event planner – Introduces a continuous‑time Markov chain that schedules discrete events (touches, grasps, handoffs) to produce crisp, well‑phased collaborations.
State‑of‑the‑art adherence & speed – Demonstrates higher constraint satisfaction and perceptual quality than diffusion‑only baselines while cutting inference time by an order of magnitude.

Methodology

Sketch‑driven diffusion prior – A conditional diffusion model is trained to predict latent motion representations from 2‑D sketches. The model learns the distribution of plausible multi‑human motions that respect the sketch’s spatial layout.
Rectified‑flow distillation – The diffusion prior is distilled into a deterministic flow model that operates in the same latent space. This “student” model learns a transport map that can generate samples in a single forward pass, dramatically speeding up inference.
Energy‑guided transport – Differentiable energy terms encode:
- Keyframe constraints (specific joint positions at given times)
- Trajectory constraints (desired paths for hands or objects)
- Physics constraints (collision avoidance, ground contact)
  These energies are back‑propagated into the flow field, nudging generated motions toward satisfying the storyboard.
CTMC event planner – A continuous‑time Markov chain models discrete interaction events. The planner samples a sequence of event times and types (e.g., “hand‑over at t=1.2 s”), which modulate the flow dynamics, ensuring that multi‑agent actions are temporally aligned and physically plausible.
Latent‑space decoding – The final latent motion is decoded into full 3‑D skeletal trajectories and object poses, ready for rendering or downstream simulation.

Results & Findings

Constraint adherence: On the CORE4D and InterHuman benchmarks, Sketch2Colab reduced keyframe error by ~35 % and increased contact‑accuracy (e.g., hand‑object touches) by ~28 % compared to diffusion‑only baselines.
Perceptual quality: Human evaluators rated the generated animations 1.2 × more realistic on a 5‑point Likert scale.
Inference speed: The rectified‑flow student generates a 5‑second multi‑human clip in ~120 ms on a single RTX 3090, versus ~1.5 s for the diffusion baseline.
Robustness to multi‑entity conditioning: Even with 4 interacting agents and multiple objects, the system maintains stable sampling without mode collapse, a common failure mode in pure diffusion models.

Practical Implications

Rapid prototyping for games and VR/AR: Designers can sketch a quick storyboard and instantly obtain a physically plausible multi‑character animation, cutting iteration cycles dramatically.
Automated content generation pipelines: Studios can feed large libraries of 2‑D concept art into Sketch2Colab to bootstrap motion capture data, reducing reliance on expensive mocap sessions.
Interactive robotics simulation: The CTMC planner’s event‑level control can be repurposed for simulating collaborative robot‑human tasks, where precise timing of handovers and grasps matters.
AI‑assisted animation tools: Integration into existing tools (e.g., Blender, Unity) as a plug‑in would let artists refine sketches, adjust constraint weights, and instantly preview 3‑D motion.

Limitations & Future Work

Sketch quality dependence: Extremely abstract or ambiguous sketches can lead to ambiguous motion hypotheses; the system currently assumes reasonably clear spatial cues.
Fixed body topology: The model is trained on standard human skeletons; extending to non‑human avatars or highly stylized rigs would require additional data.
Physics realism: While basic contact and collision constraints are enforced, fine‑grained dynamics (e.g., cloth simulation, soft‑body deformation) are not modeled.
Future directions: The authors plan to incorporate learned physics simulators for richer dynamics, explore multimodal conditioning (e.g., audio cues), and open‑source a lightweight SDK for easy integration into production pipelines.

Authors

Divyanshu Daiya
Aniket Bera

Paper Information

arXiv ID: 2603.02190v1
Categories: cs.CV, cs.AI, cs.GR, cs.HC, cs.LG
Published: March 2, 2026
PDF: Download PDF

[Paper] Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Adaptive Confidence Regularization for Multimodal Failure Detection

[Paper] From Leaderboard to Deployment: Code Quality Challenges in AV Perception Repositories

[Paper] Leveraging Model Soups to Classify Intangible Cultural Heritage Images from the Mekong Delta

[Paper] Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance