[Paper] DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

Published: 3 days ago (February 12, 2026 at 12:44 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.12205v1

Overview

DeepGen 1.0 is a 5 billion‑parameter unified multimodal model that can both generate new images from text and edit existing images with fine‑grained control. By introducing a novel alignment architecture and a three‑stage, data‑centric training pipeline, the authors show that a relatively small model can match or beat much larger (10‑100 B‑parameter) systems on a wide range of generation, editing, and reasoning benchmarks.

Key Contributions

Stacked Channel Bridging (SCB) – a deep alignment module that fuses hierarchical vision‑language features with learnable “think tokens”, giving the generative backbone structured, reasoning‑rich guidance.
Three‑stage training strategy:
1. Alignment pre‑training on massive image‑text pairs and editing triplets to sync a Vision‑Language Model (VLM) with a Diffusion Transformer (DiT).
2. Joint supervised fine‑tuning on a curated mix of generation, editing, and visual‑reasoning tasks.
3. Reinforcement learning with MR‑GRPO (Mixture‑of‑Rewards Guided Reinforcement Policy Optimization) that blends multiple reward signals and supervision to improve quality and human‑preference alignment.
Data efficiency – achieves state‑of‑the‑art results while training on only ~50 M samples, an order of magnitude fewer than competing models.
Open‑source release of code, pretrained weights, and the curated datasets, lowering the barrier for research and product development.

Methodology

Backbone – DeepGen uses a DiT (Diffusion Transformer) as its generative core, which is lighter than typical large diffusion models.
SCB Alignment Layer – Features from several VLM layers (e.g., CLIP‑style encoders) are stacked and passed through channel‑wise bridges. Learnable “think tokens” are inserted to carry high‑level semantic intent, allowing the diffusion model to receive both coarse concepts and fine details.
Training Pipeline
- Stage 1: The VLM and DiT are jointly trained on image‑text pairs (for caption grounding) and editing triplets (source image + instruction → target image) to align their latent spaces.
- Stage 2: A mixed‑task dataset (text‑to‑image, in‑painting, style transfer, visual reasoning) is used for supervised fine‑tuning, encouraging the model to handle any multimodal request.
- Stage 3: MR‑GRPO applies reinforcement learning where multiple reward models (e.g., CLIP‑based similarity, aesthetic scores, user‑preference logits) guide policy updates. The algorithm balances exploration with stability, preventing common diffusion artifacts.

Results & Findings

Benchmark	Compared Model	DeepGen 1.0 Score	Relative Gain
WISE (image generation)	HunyuanImage (80 B)	+28 % over baseline	Beats a model 16× larger
UniREditBench (image editing)	Qwen‑Image‑Edit (27 B)	+37 % over baseline	Outperforms a 5× larger model
COCO‑Text2Img	StableDiffusion (2 B)	Comparable FID	Same quality with 2.5× parameters
Visual Reasoning (VQA‑style)	Flamingo (80 B)	Within 5 % of accuracy	Shows strong reasoning despite size

Key take‑aways:

Parameter efficiency – 5 B parameters suffice for high‑fidelity generation and precise editing.
Training data economy – ~50 M samples achieve or exceed results of models trained on hundreds of millions.
Robustness – MR‑GRPO reduces visual glitches (e.g., checkerboard artifacts) that often appear in RL‑fine‑tuned diffusion models.

Practical Implications

Product teams can embed a single 5 B model to power both text‑to‑image generation and interactive image editing (e.g., background replacement, style transfer) without maintaining separate pipelines.
Edge‑or‑on‑device deployment becomes feasible on high‑end GPUs or cloud inference services with lower memory footprints, cutting operational costs.
Creative tools (design software, game asset pipelines, marketing platforms) gain a unified AI assistant that understands nuanced prompts and can iteratively refine outputs based on user feedback.
Research democratization – Open‑sourced training scripts and datasets enable smaller labs or startups to experiment with unified multimodal models without the massive compute budgets traditionally required.

Limitations & Future Work

Domain coverage – The training data, while diverse, still under‑represents niche domains (medical imaging, scientific visualization), limiting out‑of‑distribution performance.
Real‑time editing latency – Although lighter than 80 B models, inference at high resolutions (>1024 px) can still be several seconds on a single GPU; further optimization (e.g., distillation, quantization) is needed for truly interactive use.
Reward design – MR‑GRPO relies on handcrafted reward models; exploring more automated preference learning could improve alignment with end‑user aesthetics.
Explainability – The “think tokens” provide a useful abstraction, but their internal semantics are not yet interpretable; future work could probe how the model reasons about complex instructions.

DeepGen 1.0 demonstrates that thoughtful architecture and training design can dramatically shrink the size‑to‑performance gap in unified multimodal AI, opening the door for broader adoption across industry and research.

Authors

Dianyi Wang
Ruihang Li
Feng Han
Chaofan Ma
Wei Song
Siyuan Wang
Yibin Wang
Yi Xin
Hongjian Liu
Zhixiong Zhang
Shengyuan Ding
Tianhang Wang
Zhenglin Cheng
Tao Lin
Cheng Jin
Kaicheng Yu
Jingjing Chen
Wenjie Wang
Zhongyu Wei
Jiaqi Wang

Paper Information

arXiv ID: 2602.12205v1
Categories: cs.CV, cs.AI
Published: February 12, 2026
PDF: Download PDF

[Paper] DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

[Paper] MonarchRT: Efficient Attention for Real-Time Video Generation

[Paper] Energy-Aware Spike Budgeting for Continual Learning in Spiking Neural Networks for Neuromorphic Vision

[Paper] Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training