[Paper] CoCo: Code as CoT for Text-to-Image Preview and Rare Concept Generation

Published: 1 day ago (March 9, 2026 at 01:31 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2603.08652v1

Overview

The paper introduces CoCo (Code-as-CoT), a novel framework that treats the reasoning step in text‑to‑image (T2I) generation as executable code rather than a free‑form natural‑language plan. By first producing a deterministic “draft” image from the generated code and then refining it, CoCo dramatically improves the fidelity of complex scenes, structured layouts, and long textual descriptions—areas where existing chain‑of‑thought (CoT) approaches struggle.

Key Contributions

Code‑driven reasoning: Transforms the CoT planning stage into a program that can be run in a sandbox, yielding a concrete visual draft.
Two‑stage generation pipeline: (1) Draft creation from code, (2) Fine‑grained image editing to reach high‑quality final output.
CoCo‑10K dataset: Curated 10 K pairs of structured draft images and their refined counterparts, enabling supervised learning of both drafting and correction.
Strong empirical gains: Achieves +68.8 % on StructT2IBench, +54.8 % on OneIG‑Bench, and +41.2 % on LongText‑Bench compared with direct generation, and outperforms other CoT‑augmented methods.
Open‑source release: Code, model checkpoints, and the dataset are publicly available, encouraging reproducibility and downstream extensions.

Methodology

Prompt → Code Generation
- A large multimodal model receives the natural‑language prompt and outputs a short script (e.g., in a domain‑specific language that describes object positions, sizes, colors, and relationships).
- The script is deliberately deterministic: running it always yields the same layout, removing ambiguity inherent in pure text plans.
Sandbox Execution → Draft Image
- The generated script is executed in an isolated environment that renders a low‑resolution, structurally accurate draft.
- Because the code is executable, developers can inspect, debug, or even manually edit the plan before rendering.
Draft → Refined Image
- A second model (or a diffusion‑based editor) takes the draft and the original prompt to perform fine‑grained edits: adding textures, lighting, details, and correcting any mismatches.
- This stage is trained on the CoCo‑10K pairs, teaching the system how to transform a rough layout into a photorealistic result.
Training Regime
- The pipeline is trained end‑to‑end with supervised losses on both the code generation (teacher‑forced from ground‑truth scripts) and the image refinement (pixel‑wise and perceptual losses).
- Curriculum learning is used: early epochs focus on simple scenes, later epochs on complex, long‑form prompts.

Results & Findings

Benchmark	Metric (higher = better)	Direct Generation	CoCo (this work)	Relative Gain
StructT2IBench	Layout‑F1	0.42	0.71	+68.8 %
OneIG‑Bench	Image‑Quality (FID ↓)	45.3	20.5	+54.8 %
LongText‑Bench	Text‑Image Alignment (CLIP‑Score ↑)	0.31	0.44	+41.2 %

Precision: The draft stage already captures object counts and spatial relations with >90 % accuracy.
Robustness: When prompts contain rare or novel concepts, the code‑based plan prevents “hallucinations” that plague pure diffusion models.
Speed: Generating the draft is lightweight (≈0.2 s on a single GPU), and the refinement adds only a modest overhead compared with a single‑pass diffusion run.

Practical Implications

Design Tools: UI/UX or game‑level designers can script high‑level layouts in natural language, get an instant draft, and then iteratively refine—much faster than hand‑drawing or tweaking diffusion parameters.
Content Generation for Marketing: Brands needing precise placement of logos, product shots, or text overlays can rely on the deterministic draft to guarantee compliance before polishing.
Assistive Coding: Developers building multimodal assistants can expose the intermediate code to users, enabling “debug‑by‑example” where a user edits the generated script to correct a mis‑placed object.
Rare Concept Synthesis: Researchers and artists working with obscure entities (e.g., extinct species, custom inventions) gain a reliable pipeline that respects the exact semantics of the prompt.
Compliance & Auditing: Because the reasoning is represented as executable code, organizations can audit the generation process for bias or policy violations, a step forward for responsible AI deployment.

Limitations & Future Work

Domain‑Specific Language (DSL) Overhead: The current code format is tailored to the training data; extending it to new visual primitives (e.g., 3‑D depth cues) requires additional DSL design.
Scalability to Ultra‑High Resolutions: The refinement stage still relies on diffusion models that become costly at >1024 px resolutions.
Generalization to Unseen Styles: While CoCo handles layout well, stylistic nuances (e.g., impressionist brushwork) are less controlled by the code and depend on the editor model.
Future Directions: The authors suggest integrating symbolic reasoning (e.g., scene graphs) into the code, exploring hierarchical drafting (coarse → fine), and coupling the pipeline with interactive GUIs for real‑time user edits.

Authors

Haodong Li
Chunmei Qing
Huanyu Zhang
Dongzhi Jiang
Yihang Zou
Hongbo Peng
Dingming Li
Yuhong Dai
ZePeng Lin
Juanxi Tian
Yi Zhou
Siqi Dai
Jingwei Wu

Paper Information

arXiv ID: 2603.08652v1
Categories: cs.AI
Published: March 9, 2026
PDF: Download PDF

[Paper] CoCo: Code as CoT for Text-to-Image Preview and Rare Concept Generation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Task Aware Modulation Using Representation Learning for Upsaling of Terrestrial Carbon Fluxes

[Paper] From Data Statistics to Feature Geometry: How Correlations Shape Superposition

[Paper] Understanding the Use of a Large Language Model-Powered Guide to Make Virtual Reality Accessible for Blind and Low Vision People

[Paper] Emotional Modulation in Swarm Decision Dynamics