[Paper] Canvas-to-Image: Compositional Image Generation with Multimodal Controls

Published: (November 26, 2025 at 01:59 PM EST)
1 min read
Source: arXiv

Source: arXiv - 2511.21691v1

Overview

While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, particularly when users simultaneously specify text prompts, subject references, spatial arrangements, pose constraints, and layout annotations. We introduce Canvas-to-Image, a unified framework that consolidates these heterogeneous controls into a single canvas interface, enabling users to generate images that faithfully reflect their intent.

Our key idea is to encode diverse control signals into a single composite canvas image that the model can directly interpret for integrated visual‑spatial reasoning. We further curate a suite of multi‑task datasets and propose a Multi‑Task Canvas Training strategy that optimizes the diffusion model to jointly understand and integrate heterogeneous controls into text‑to‑image generation within a unified learning paradigm. This joint training enables Canvas-to-Image to reason across multiple control modalities rather than relying on task‑specific heuristics, and it generalizes well to multi‑control scenarios during inference.

Extensive experiments show that Canvas-to-Image significantly outperforms state‑of‑the‑art methods in identity preservation and control adherence across challenging benchmarks, including multi‑person composition, pose‑controlled composition, layout‑constrained generation, and multi‑control generation.

Authors

  • Yusuf Dalva
  • Guocheng Gordon Qian
  • Maya Goldenberg
  • Tsai‑Shien Chen
  • Kfir Aberman
  • Sergey Tulyakov
  • Pinar Yanardag
  • Kuan‑Chieh Jackson Wang

Paper Information

  • arXiv ID: 2511.21691v1
  • Categories: cs.CV
  • Published: November 27, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »