[Paper] Visual Generation Tuning

Published: 2 months ago (November 28, 2025 at 01:57 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2511.23469v1

Overview

The paper introduces Visual Generation Tuning (VGT), a lightweight fine‑tuning recipe that unlocks image‑generation abilities in any large Vision‑Language Model (VLM) originally trained for multimodal understanding. By re‑using the rich semantic encoders learned during pre‑training, VGT sidesteps costly pixel‑level autoencoders and speeds up autoregressive generation by up to 20×, while delivering state‑of‑the‑art image synthesis quality.

Key Contributions

Unified generation pipeline: Shows that pretrained VLMs can be repurposed for visual generation without redesigning the whole architecture.
VGT‑AE design: Replaces the usual VAE‑style latent space with a semantic‑aligned latent representation obtained by matching VLM encoders to a lightweight pixel decoder.
Efficiency boost: Achieves a 20–28× speedup in training convergence compared with diffusion‑based transformers that rely on separate pixel‑level VAEs.
Strong empirical results:
- Image reconstruction: 26.67 PSNR and 0.50 rFID at a 28× compression ratio, beating dedicated VAEs.
- Autoregressive generation: 0.77 GenEval and 78.73 DPG‑Bench, the best among comparable AR models.
Scalability & versatility: Demonstrates that VGT can be applied to a variety of existing VLMs, opening a path toward truly unified multimodal foundation models.

Methodology

Start from a pretrained VLM (e.g., CLIP‑style models) that already maps images and text into a shared semantic space.
Introduce a lightweight pixel decoder (a shallow CNN that maps latent vectors back to RGB images).
Align the VLM’s semantic encoder with the decoder’s latent space via a simple reconstruction loss, effectively turning the encoder into a visual generator (the VGT‑AE).
Train an autoregressive transformer on top of these aligned latents to model the distribution of image tokens in continuous space.
Fine‑tune only the new components (decoder + transformer) while freezing most of the original VLM, drastically reducing compute and data requirements.

The key insight is that the semantic knowledge embedded in the VLM’s encoder already captures high‑level visual structure; aligning it with a modest decoder is enough to recover pixel‑level detail for generation tasks.

Results & Findings

Task	Metric	VGT (this work)	Prior Art
Image reconstruction (compression 28×)	PSNR	26.67	~24–25
	rFID	0.50	>0.7
Autoregressive image synthesis	GenEval	0.77	0.68–0.73
	DPG‑Bench	78.73	70–75

Training speed: Convergence reached in ~1/20 of the steps required by diffusion‑based transformers that rely on separate VAEs.
Quality vs. compression: Even at high compression ratios, VGT retains fine details, indicating that the semantic encoder preserves more information than conventional VAEs.
Scalability: Experiments scaling the VLM size (from 300 M to 1 B parameters) show consistent improvements, suggesting the approach benefits from larger foundation models.

Practical Implications

Rapid prototyping of generative features: Companies can add image‑generation capabilities to existing multimodal services (e.g., captioning, visual search) without retraining massive diffusion models from scratch.
Reduced infrastructure costs: The 20× faster convergence translates to lower GPU hours and energy consumption, making generative AI more accessible for startups and edge deployments.
Unified APIs: A single VLM can now serve both understanding (classification, retrieval) and creation (synthesis, editing) tasks, simplifying product pipelines and reducing model‑management overhead.
Potential for downstream tools: Text‑to‑image assistants, design mock‑up generators, and data‑augmentation pipelines can leverage VGT‑enhanced VLMs for higher fidelity outputs with fewer resources.

Limitations & Future Work

Pixel decoder simplicity: The current decoder is intentionally lightweight; more sophisticated decoders could further improve fidelity but may erode the efficiency gains.
Dependence on pretrained VLM quality: If the base VLM has weak visual semantics, VGT’s generation quality suffers, highlighting the need for strong foundation models.
Evaluation scope: Benchmarks focus on reconstruction and generic image synthesis; applying VGT to domain‑specific generation (e.g., medical imaging, 3‑D assets) remains an open question.
Future directions: The authors suggest exploring tighter integration of VGT with diffusion processes, extending the paradigm to video generation, and investigating multimodal prompting (text + sketch) to further enrich the unified model’s capabilities.

Authors

Jiahao Guo
Sinan Du
Jingfeng Yao
Wenyu Liu
Bo Li
Haoxiang Cao
Kun Gai
Chun Yuan
Kai Wu
Xinggang Wang

Paper Information

arXiv ID: 2511.23469v1
Categories: cs.CV
Published: November 28, 2025
PDF: Download PDF

[Paper] Visual Generation Tuning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models

[Paper] Video-CoM: Interactive Video Reasoning via Chain of Manipulations

[Paper] AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

[Paper] Object-Centric Data Synthesis for Category-level Object Detection