[Paper] Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation

Published: (March 2, 2026 at 01:52 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2603.02190v1

Overview

Sketch2Colab is a new framework that turns simple 2‑D storyboard sketches into realistic, object‑aware 3‑D multi‑human animations. By combining a sketch‑driven diffusion prior with a fast “rectified‑flow” student model, it lets developers generate coordinated human‑object interactions while keeping tight control over agents, joints, timing, and contacts.

Key Contributions

  • Sketch‑conditioned motion generation – Directly maps storyboard‑style sketches to 3‑D multi‑human motion without needing large amounts of paired motion data.
  • Two‑stage diffusion‑to‑flow distillation – Learns a diffusion prior, then distills it into a latent‑space rectified‑flow student for rapid, stable sampling.
  • Differentiable constraint energies – Integrates keyframe, trajectory, and physics‑based losses that steer the flow field to satisfy rich interaction constraints.
  • CTMC‑based event planner – Introduces a continuous‑time Markov chain that schedules discrete events (touches, grasps, handoffs) to produce crisp, well‑phased collaborations.
  • State‑of‑the‑art adherence & speed – Demonstrates higher constraint satisfaction and perceptual quality than diffusion‑only baselines while cutting inference time by an order of magnitude.

Methodology

  1. Sketch‑driven diffusion prior – A conditional diffusion model is trained to predict latent motion representations from 2‑D sketches. The model learns the distribution of plausible multi‑human motions that respect the sketch’s spatial layout.

  2. Rectified‑flow distillation – The diffusion prior is distilled into a deterministic flow model that operates in the same latent space. This “student” model learns a transport map that can generate samples in a single forward pass, dramatically speeding up inference.

  3. Energy‑guided transport – Differentiable energy terms encode:

    • Keyframe constraints (specific joint positions at given times)
    • Trajectory constraints (desired paths for hands or objects)
    • Physics constraints (collision avoidance, ground contact)
      These energies are back‑propagated into the flow field, nudging generated motions toward satisfying the storyboard.
  4. CTMC event planner – A continuous‑time Markov chain models discrete interaction events. The planner samples a sequence of event times and types (e.g., “hand‑over at t=1.2 s”), which modulate the flow dynamics, ensuring that multi‑agent actions are temporally aligned and physically plausible.

  5. Latent‑space decoding – The final latent motion is decoded into full 3‑D skeletal trajectories and object poses, ready for rendering or downstream simulation.

Results & Findings

  • Constraint adherence: On the CORE4D and InterHuman benchmarks, Sketch2Colab reduced keyframe error by ~35 % and increased contact‑accuracy (e.g., hand‑object touches) by ~28 % compared to diffusion‑only baselines.
  • Perceptual quality: Human evaluators rated the generated animations 1.2 × more realistic on a 5‑point Likert scale.
  • Inference speed: The rectified‑flow student generates a 5‑second multi‑human clip in ~120 ms on a single RTX 3090, versus ~1.5 s for the diffusion baseline.
  • Robustness to multi‑entity conditioning: Even with 4 interacting agents and multiple objects, the system maintains stable sampling without mode collapse, a common failure mode in pure diffusion models.

Practical Implications

  • Rapid prototyping for games and VR/AR: Designers can sketch a quick storyboard and instantly obtain a physically plausible multi‑character animation, cutting iteration cycles dramatically.
  • Automated content generation pipelines: Studios can feed large libraries of 2‑D concept art into Sketch2Colab to bootstrap motion capture data, reducing reliance on expensive mocap sessions.
  • Interactive robotics simulation: The CTMC planner’s event‑level control can be repurposed for simulating collaborative robot‑human tasks, where precise timing of handovers and grasps matters.
  • AI‑assisted animation tools: Integration into existing tools (e.g., Blender, Unity) as a plug‑in would let artists refine sketches, adjust constraint weights, and instantly preview 3‑D motion.

Limitations & Future Work

  • Sketch quality dependence: Extremely abstract or ambiguous sketches can lead to ambiguous motion hypotheses; the system currently assumes reasonably clear spatial cues.
  • Fixed body topology: The model is trained on standard human skeletons; extending to non‑human avatars or highly stylized rigs would require additional data.
  • Physics realism: While basic contact and collision constraints are enforced, fine‑grained dynamics (e.g., cloth simulation, soft‑body deformation) are not modeled.
  • Future directions: The authors plan to incorporate learned physics simulators for richer dynamics, explore multimodal conditioning (e.g., audio cues), and open‑source a lightweight SDK for easy integration into production pipelines.

Authors

  • Divyanshu Daiya
  • Aniket Bera

Paper Information

  • arXiv ID: 2603.02190v1
  • Categories: cs.CV, cs.AI, cs.GR, cs.HC, cs.LG
  • Published: March 2, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »