[Paper] InterleaveThinker: Reinforcing Agentic Interleaved Generation

Published: 3 days ago (June 11, 2026 at 01:59 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.13679v1

Overview

Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text-image sequence), which has crucial applications in visual narratives, guidance, and embodied manipulation. Even the latest open-source Unified Multimodal Models (UMMs) exhibit limited performance in this regard. In this paper, we introduce InterleaveThinker, the first multi-agent pipeline designed to endow any existing image generator with interleaved generation capabilities. Specifically, we employ a planner agent to organize the image-text input sequence, instructing the image generator on the required execution at each step. Subsequently, we introduce a critic agent to evaluate the generator’s outputs, identify samples that deviate from the planned instructions, and refine the instructions for regeneration. To implement this pipeline, we construct the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k to perform a format cold-start. Then we develop Interleave-Critic-RL-13k to reinforce the step-wise instruction correction capability within a generation trajectory using GRPO. Since a single interleaved generation trajectory may involve over 25 generator calls, optimizing the entire trajectory is computationally impractical. Therefore, we propose accuracy reward and step-wise reward, allowing single-step RL to effectively guide the entire generation trajectory. The results show that InterleaveThinker improves performance across various image generators. On interleaved generation benchmarks, it achieves performance comparable to Nano Banana and GPT-5. Surprisingly, it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX.2-klein, we observe substantial gains on WISE and RISE.

Key Contributions

This paper presents research in the following areas:

cs.CV

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CV.

Authors

Dian Zheng
Harry Lee
Manyuan Zhang
Kaituo Feng
Zoey Guo
Ray Zhang
Hongsheng Li

Paper Information

arXiv ID: 2606.13679v1
Categories: cs.CV
Published: June 11, 2026
PDF: Download PDF

[Paper] InterleaveThinker: Reinforcing Agentic Interleaved Generation

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Mana: Dexterous Manipulation of Articulated Tools

[Paper] Modality Forcing for Scalable Spatial Generation

[Paper] RepWAM: World Action Modeling with Representation Visual-Action Tokenizers

[Paper] SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning