[Paper] CoLoGen: Progressive Learning of Concept`-`Localization Duality for Unified Image Generation

Published: (February 25, 2026 at 12:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.22150v1

Overview

CoLoGen tackles a long‑standing roadblock in conditional image generation: the clash between conceptual understanding (what to draw) and localization precision (where to draw it). By introducing a progressive learning scheme that lets the model first master each ability separately and then weave them together, the authors deliver a unified diffusion model that works across editing, controllable synthesis, and custom generation tasks.

Key Contributions

  • Concept‑Localization Duality Framework – formalizes the representational conflict between semantic (concept) and spatial (localization) cues in a single diffusion model.
  • Progressive Representation Weaving (PRW) – a dynamic routing module that dispatches features to dedicated “concept” and “localization” experts and merges their outputs in a stable, stage‑wise manner.
  • Curriculum‑Style Training Pipeline – three‑stage curriculum (core skill acquisition → condition adaptation → synergy refinement) that progressively aligns the dual representations.
  • Unified Diffusion Architecture – a single set of weights that can handle diverse conditional generation tasks (text‑to‑image, mask‑guided editing, style‑customization) without task‑specific heads.
  • Strong Empirical Performance – competitive or superior results on benchmark suites for image editing, controllable generation, and personalized generation, with fewer parameters than multi‑model baselines.

Methodology

  1. Base Diffusion Backbone – starts from a standard latent diffusion model (LDM) that predicts noise in a compressed latent space.
  2. Expert Modules
    • Concept Expert: processes high‑level conditioning (e.g., text prompts, style tokens) to capture semantic intent.
    • Localization Expert: ingests spatial cues (masks, keypoints, bounding boxes) to preserve precise geometry.
  3. Progressive Representation Weaving (PRW)
    • At each diffusion timestep, PRW evaluates a gating network that decides how much of the concept vs. localization feature map should influence the denoising step.
    • The gated features are fused via a lightweight attention mixer, ensuring gradients flow smoothly across stages.
  4. Three‑Stage Curriculum
    • Stage 1 – Core Skill Building: train experts on pure concept (text‑only) and pure localization (mask‑only) tasks separately.
    • Stage 2 – Condition Adaptation: expose the model to mixed cues (e.g., “a red car inside a blue box”) so PRW learns to balance the two streams.
    • Stage 3 – Synergy Refinement: fine‑tune on complex instruction‑driven datasets (multi‑object scenes, style‑plus‑layout) to polish the joint representation.
  5. Training Details – uses classifier‑free guidance, a cosine noise schedule, and a modest batch size (≈64) on 8‑GPU nodes; total training time ≈ 2 days on a V100 cluster.

Results & Findings

TaskMetric (higher = better)CoLoGenPrior Unified ModelSpecialized Baseline
Text‑to‑Image (FID)7.87.28.56.9 (task‑specific)
Mask‑Guided Editing (LPIPS)0.210.180.240.17
Personalized Generation (CLIP‑Score)0.840.860.800.88
  • Conceptual fidelity improves by ~10 % over the previous unified diffusion baseline, thanks to the dedicated concept expert.
  • Spatial accuracy (measured by mask alignment and LPIPS) gains ~15 % thanks to the localization expert and PRW gating.
  • The model matches or exceeds task‑specific specialists while using a single set of weights, confirming the effectiveness of the progressive curriculum.

Practical Implications

  • One‑Model Deployment – Companies can ship a single diffusion service that handles text‑to‑image, in‑painting, and style‑customization without maintaining separate pipelines.
  • Developer‑Friendly API – The PRW gating logic can be exposed as a lightweight “mode” flag (concept vs. localization emphasis), enabling fine‑grained control for UI designers.
  • Reduced Training Costs – By sharing parameters across tasks, organizations save GPU hours compared to training separate models for each conditional generation scenario.
  • Better User Experience – The progressive learning approach yields more reliable placement of objects in generated scenes, a common pain point in creative tools (e.g., graphic design SaaS, game asset generation).
  • Extensibility – New conditioning modalities (depth maps, sketches) can be hooked into the existing PRW framework as additional experts, accelerating feature roll‑outs.

Limitations & Future Work

  • Scalability of Experts – Adding many specialized experts may inflate memory usage; the current design balances two experts only.
  • Curriculum Sensitivity – The three‑stage schedule requires careful hyper‑parameter tuning; automatic curriculum learning could simplify this.
  • Generalization to High‑Resolution – Experiments were limited to 512 × 512 latent space; extending to ultra‑high‑resolution outputs may need additional up‑sampling tricks.
  • User‑Controlled Trade‑off – While PRW internally balances concept vs. localization, exposing a smooth “bias” knob to end‑users remains an open UI challenge.

Future research directions include hierarchical expert trees for richer conditioning types, meta‑learning the curriculum schedule, and integrating diffusion‑based video generation under the same duality framework.

Authors

  • YuXin Song
  • Yu Lu
  • Haoyuan Sun
  • Huanjin Yao
  • Fanglong Liu
  • Yifan Sun
  • Haocheng Feng
  • Hang Zhou
  • Jingdong Wang

Paper Information

  • arXiv ID: 2602.22150v1
  • Categories: cs.CV
  • Published: February 25, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] A Dataset is Worth 1 MB

A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on divers...