[Paper] CoCo: Code as CoT for Text-to-Image Preview and Rare Concept Generation

Published: (March 9, 2026 at 01:31 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2603.08652v1

Overview

The paper introduces CoCo (Code-as-CoT), a novel framework that treats the reasoning step in text‑to‑image (T2I) generation as executable code rather than a free‑form natural‑language plan. By first producing a deterministic “draft” image from the generated code and then refining it, CoCo dramatically improves the fidelity of complex scenes, structured layouts, and long textual descriptions—areas where existing chain‑of‑thought (CoT) approaches struggle.

Key Contributions

  • Code‑driven reasoning: Transforms the CoT planning stage into a program that can be run in a sandbox, yielding a concrete visual draft.
  • Two‑stage generation pipeline: (1) Draft creation from code, (2) Fine‑grained image editing to reach high‑quality final output.
  • CoCo‑10K dataset: Curated 10 K pairs of structured draft images and their refined counterparts, enabling supervised learning of both drafting and correction.
  • Strong empirical gains: Achieves +68.8 % on StructT2IBench, +54.8 % on OneIG‑Bench, and +41.2 % on LongText‑Bench compared with direct generation, and outperforms other CoT‑augmented methods.
  • Open‑source release: Code, model checkpoints, and the dataset are publicly available, encouraging reproducibility and downstream extensions.

Methodology

  1. Prompt → Code Generation

    • A large multimodal model receives the natural‑language prompt and outputs a short script (e.g., in a domain‑specific language that describes object positions, sizes, colors, and relationships).
    • The script is deliberately deterministic: running it always yields the same layout, removing ambiguity inherent in pure text plans.
  2. Sandbox Execution → Draft Image

    • The generated script is executed in an isolated environment that renders a low‑resolution, structurally accurate draft.
    • Because the code is executable, developers can inspect, debug, or even manually edit the plan before rendering.
  3. Draft → Refined Image

    • A second model (or a diffusion‑based editor) takes the draft and the original prompt to perform fine‑grained edits: adding textures, lighting, details, and correcting any mismatches.
    • This stage is trained on the CoCo‑10K pairs, teaching the system how to transform a rough layout into a photorealistic result.
  4. Training Regime

    • The pipeline is trained end‑to‑end with supervised losses on both the code generation (teacher‑forced from ground‑truth scripts) and the image refinement (pixel‑wise and perceptual losses).
    • Curriculum learning is used: early epochs focus on simple scenes, later epochs on complex, long‑form prompts.

Results & Findings

BenchmarkMetric (higher = better)Direct GenerationCoCo (this work)Relative Gain
StructT2IBenchLayout‑F10.420.71+68.8 %
OneIG‑BenchImage‑Quality (FID ↓)45.320.5+54.8 %
LongText‑BenchText‑Image Alignment (CLIP‑Score ↑)0.310.44+41.2 %
  • Precision: The draft stage already captures object counts and spatial relations with >90 % accuracy.
  • Robustness: When prompts contain rare or novel concepts, the code‑based plan prevents “hallucinations” that plague pure diffusion models.
  • Speed: Generating the draft is lightweight (≈0.2 s on a single GPU), and the refinement adds only a modest overhead compared with a single‑pass diffusion run.

Practical Implications

  • Design Tools: UI/UX or game‑level designers can script high‑level layouts in natural language, get an instant draft, and then iteratively refine—much faster than hand‑drawing or tweaking diffusion parameters.
  • Content Generation for Marketing: Brands needing precise placement of logos, product shots, or text overlays can rely on the deterministic draft to guarantee compliance before polishing.
  • Assistive Coding: Developers building multimodal assistants can expose the intermediate code to users, enabling “debug‑by‑example” where a user edits the generated script to correct a mis‑placed object.
  • Rare Concept Synthesis: Researchers and artists working with obscure entities (e.g., extinct species, custom inventions) gain a reliable pipeline that respects the exact semantics of the prompt.
  • Compliance & Auditing: Because the reasoning is represented as executable code, organizations can audit the generation process for bias or policy violations, a step forward for responsible AI deployment.

Limitations & Future Work

  • Domain‑Specific Language (DSL) Overhead: The current code format is tailored to the training data; extending it to new visual primitives (e.g., 3‑D depth cues) requires additional DSL design.
  • Scalability to Ultra‑High Resolutions: The refinement stage still relies on diffusion models that become costly at >1024 px resolutions.
  • Generalization to Unseen Styles: While CoCo handles layout well, stylistic nuances (e.g., impressionist brushwork) are less controlled by the code and depend on the editor model.
  • Future Directions: The authors suggest integrating symbolic reasoning (e.g., scene graphs) into the code, exploring hierarchical drafting (coarse → fine), and coupling the pipeline with interactive GUIs for real‑time user edits.

Authors

  • Haodong Li
  • Chunmei Qing
  • Huanyu Zhang
  • Dongzhi Jiang
  • Yihang Zou
  • Hongbo Peng
  • Dingming Li
  • Yuhong Dai
  • ZePeng Lin
  • Juanxi Tian
  • Yi Zhou
  • Siqi Dai
  • Jingwei Wu

Paper Information

  • arXiv ID: 2603.08652v1
  • Categories: cs.AI
  • Published: March 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »