[Paper] Canvas-to-Image: Compositional Image Generation with Multimodal Controls
Source: arXiv - 2511.21691v1
Overview
While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, particularly when users simultaneously specify text prompts, subject references, spatial arrangements, pose constraints, and layout annotations. We introduce Canvas-to-Image, a unified framework that consolidates these heterogeneous controls into a single canvas interface, enabling users to generate images that faithfully reflect their intent.
Our key idea is to encode diverse control signals into a single composite canvas image that the model can directly interpret for integrated visual‑spatial reasoning. We further curate a suite of multi‑task datasets and propose a Multi‑Task Canvas Training strategy that optimizes the diffusion model to jointly understand and integrate heterogeneous controls into text‑to‑image generation within a unified learning paradigm. This joint training enables Canvas-to-Image to reason across multiple control modalities rather than relying on task‑specific heuristics, and it generalizes well to multi‑control scenarios during inference.
Extensive experiments show that Canvas-to-Image significantly outperforms state‑of‑the‑art methods in identity preservation and control adherence across challenging benchmarks, including multi‑person composition, pose‑controlled composition, layout‑constrained generation, and multi‑control generation.
Authors
- Yusuf Dalva
- Guocheng Gordon Qian
- Maya Goldenberg
- Tsai‑Shien Chen
- Kfir Aberman
- Sergey Tulyakov
- Pinar Yanardag
- Kuan‑Chieh Jackson Wang
Paper Information
- arXiv ID: 2511.21691v1
- Categories: cs.CV
- Published: November 27, 2025
- PDF: Download PDF