[Paper] VideoSketcher: Video Models Prior Enable Versatile Sequential Sketch Generation

Published: (February 17, 2026 at 01:55 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.15819v1

Overview

The paper VideoSketcher introduces a novel way to generate sketches as sequences of strokes rather than static images. By repurposing pretrained text‑to‑video diffusion models, the authors can synthesize realistic drawing processes that follow user‑specified stroke orders, opening the door to more interactive and controllable sketch‑generation tools.

Key Contributions

  • Sequential sketch generation using video diffusion models, treating a sketch as a short video where each frame adds new strokes.
  • Two‑stage fine‑tuning that first learns stroke ordering from synthetic shape compositions, then learns visual appearance from only seven human‑drawn sketch videos.
  • LLM‑driven semantic planning: large language models provide natural‑language instructions that dictate the order of strokes.
  • Extensible control mechanisms, including brush‑style conditioning and autoregressive generation for collaborative drawing scenarios.
  • Data‑efficiency: achieves high‑quality results with a fraction of the data normally required for video generation models.

Methodology

  1. Representing sketches as videos – Each sketch is encoded as a sequence of frames, starting from a blank canvas and progressively adding strokes.
  2. Leveraging pretrained models – A text‑to‑video diffusion model (trained on large video corpora) serves as a powerful renderer that can produce temporally coherent frames.
  3. Two‑stage fine‑tuning
    • Stage 1 (ordering): Synthetic datasets of simple geometric shapes are created with known stroke orders. The model learns to map textual ordering cues (e.g., “draw the circle first, then the square”) to the correct temporal progression.
    • Stage 2 (appearance): A handful of real sketching videos (≈7) are used to teach the model the visual style of hand‑drawn strokes, including line thickness, shading, and subtle jitter.
  4. LLM integration – An LLM parses user prompts and produces an ordered list of drawing instructions, which are fed to the diffusion model as conditioning tokens.
  5. Extensions – Brush‑style tokens and an autoregressive loop allow the system to change pen attributes on the fly or to let a second agent continue a partially drawn sketch.

Results & Findings

  • High‑fidelity stroke sequences: Generated videos show smooth, temporally consistent stroke addition that matches the prescribed order in >90 % of test prompts.
  • Visual realism: Despite training on only a few human sketches, the output captures the nuanced texture of hand‑drawn lines (e.g., pressure variation, slight wobble).
  • Robustness to diverse prompts: The system handles complex instructions like “first sketch the outline, then fill in the shading” and respects the hierarchy of components.
  • Control flexibility: Users can switch brush styles mid‑generation or ask the model to continue a partially completed drawing, demonstrating interactive potential.

Practical Implications

  • Design prototyping tools – UI/UX designers could generate step‑by‑step sketch drafts from textual concepts, speeding up ideation.
  • Educational software – Interactive tutorials that reveal drawing order for calligraphy, technical illustration, or art classes.
  • Creative AI assistants – Artists can issue high‑level commands (“draw a cat, start with the head”) and receive a live sketching process they can edit or augment.
  • Game development – Procedurally generate hand‑drawn assets (e.g., storyboards, concept art) that evolve over time, adding a dynamic visual flair.
  • Collaborative drawing platforms – Multiple users can contribute to a shared sketch, with the model ensuring smooth temporal integration of each participant’s strokes.

Limitations & Future Work

  • Data scarcity – While impressive, the model’s visual style is tied to the limited human sketch videos used for fine‑tuning; broader style diversity may require more annotated data.
  • Complex scenes – Current experiments focus on relatively simple compositions; scaling to intricate, multi‑object scenes could challenge the ordering module.
  • Real‑time performance – Diffusion models are computationally heavy; achieving low‑latency interactive drawing remains an engineering hurdle.
  • User intent ambiguity – The LLM’s translation of natural language to precise stroke order can sometimes misinterpret vague prompts; future work may incorporate clarification dialogs.

VideoSketcher demonstrates that marrying large‑scale video diffusion models with language‑driven planning can unlock a new class of generative tools that respect the temporal nature of drawing—an exciting step toward more expressive, controllable AI‑assisted creativity.

Authors

  • Hui Ren
  • Yuval Alaluf
  • Omer Bar Tal
  • Alexander Schwing
  • Antonio Torralba
  • Yael Vinker

Paper Information

  • arXiv ID: 2602.15819v1
  • Categories: cs.CV
  • Published: February 17, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »