[Paper] Visual Generation Tuning

Published: (November 28, 2025 at 01:57 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2511.23469v1

Overview

The paper introduces Visual Generation Tuning (VGT), a lightweight fine‑tuning recipe that unlocks image‑generation abilities in any large Vision‑Language Model (VLM) originally trained for multimodal understanding. By re‑using the rich semantic encoders learned during pre‑training, VGT sidesteps costly pixel‑level autoencoders and speeds up autoregressive generation by up to 20×, while delivering state‑of‑the‑art image synthesis quality.

Key Contributions

  • Unified generation pipeline: Shows that pretrained VLMs can be repurposed for visual generation without redesigning the whole architecture.
  • VGT‑AE design: Replaces the usual VAE‑style latent space with a semantic‑aligned latent representation obtained by matching VLM encoders to a lightweight pixel decoder.
  • Efficiency boost: Achieves a 20–28× speedup in training convergence compared with diffusion‑based transformers that rely on separate pixel‑level VAEs.
  • Strong empirical results:
    • Image reconstruction: 26.67 PSNR and 0.50 rFID at a 28× compression ratio, beating dedicated VAEs.
    • Autoregressive generation: 0.77 GenEval and 78.73 DPG‑Bench, the best among comparable AR models.
  • Scalability & versatility: Demonstrates that VGT can be applied to a variety of existing VLMs, opening a path toward truly unified multimodal foundation models.

Methodology

  1. Start from a pretrained VLM (e.g., CLIP‑style models) that already maps images and text into a shared semantic space.
  2. Introduce a lightweight pixel decoder (a shallow CNN that maps latent vectors back to RGB images).
  3. Align the VLM’s semantic encoder with the decoder’s latent space via a simple reconstruction loss, effectively turning the encoder into a visual generator (the VGT‑AE).
  4. Train an autoregressive transformer on top of these aligned latents to model the distribution of image tokens in continuous space.
  5. Fine‑tune only the new components (decoder + transformer) while freezing most of the original VLM, drastically reducing compute and data requirements.

The key insight is that the semantic knowledge embedded in the VLM’s encoder already captures high‑level visual structure; aligning it with a modest decoder is enough to recover pixel‑level detail for generation tasks.

Results & Findings

TaskMetricVGT (this work)Prior Art
Image reconstruction (compression 28×)PSNR26.67~24–25
rFID0.50>0.7
Autoregressive image synthesisGenEval0.770.68–0.73
DPG‑Bench78.7370–75
  • Training speed: Convergence reached in ~1/20 of the steps required by diffusion‑based transformers that rely on separate VAEs.
  • Quality vs. compression: Even at high compression ratios, VGT retains fine details, indicating that the semantic encoder preserves more information than conventional VAEs.
  • Scalability: Experiments scaling the VLM size (from 300 M to 1 B parameters) show consistent improvements, suggesting the approach benefits from larger foundation models.

Practical Implications

  • Rapid prototyping of generative features: Companies can add image‑generation capabilities to existing multimodal services (e.g., captioning, visual search) without retraining massive diffusion models from scratch.
  • Reduced infrastructure costs: The 20× faster convergence translates to lower GPU hours and energy consumption, making generative AI more accessible for startups and edge deployments.
  • Unified APIs: A single VLM can now serve both understanding (classification, retrieval) and creation (synthesis, editing) tasks, simplifying product pipelines and reducing model‑management overhead.
  • Potential for downstream tools: Text‑to‑image assistants, design mock‑up generators, and data‑augmentation pipelines can leverage VGT‑enhanced VLMs for higher fidelity outputs with fewer resources.

Limitations & Future Work

  • Pixel decoder simplicity: The current decoder is intentionally lightweight; more sophisticated decoders could further improve fidelity but may erode the efficiency gains.
  • Dependence on pretrained VLM quality: If the base VLM has weak visual semantics, VGT’s generation quality suffers, highlighting the need for strong foundation models.
  • Evaluation scope: Benchmarks focus on reconstruction and generic image synthesis; applying VGT to domain‑specific generation (e.g., medical imaging, 3‑D assets) remains an open question.
  • Future directions: The authors suggest exploring tighter integration of VGT with diffusion processes, extending the paradigm to video generation, and investigating multimodal prompting (text + sketch) to further enrich the unified model’s capabilities.

Authors

  • Jiahao Guo
  • Sinan Du
  • Jingfeng Yao
  • Wenyu Liu
  • Bo Li
  • Haoxiang Cao
  • Kun Gai
  • Chun Yuan
  • Kai Wu
  • Xinggang Wang

Paper Information

  • arXiv ID: 2511.23469v1
  • Categories: cs.CV
  • Published: November 28, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »