[Paper] SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder

Published: 1 month ago (December 12, 2025 at 12:45 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.11749v1

Overview

The paper introduces SVG‑T2I, a new text‑to‑image diffusion model that operates directly on the latent space of a Visual Foundation Model (VFM) instead of the traditional pixel‑space autoencoder pipeline. By sidestepping the variational autoencoder (VAE) bottleneck, the authors demonstrate that large‑scale diffusion can be trained entirely in the VFM feature domain while still delivering high‑fidelity, semantically rich images.

Key Contributions

VFM‑centric diffusion: First large‑scale diffusion model trained end‑to‑end on self‑supervised visual representations (SVG) without a VAE.
Competitive quality: Achieves 0.75 GenEval and 85.78 DPG‑Bench scores, on par with state‑of‑the‑art text‑to‑image systems that rely on pixel‑level autoencoders.
Open‑source ecosystem: Releases the full autoencoder, diffusion model, training scripts, inference pipelines, evaluation tools, and pretrained weights.
Scalable architecture: Demonstrates that scaling the latent diffusion pipeline to VFM dimensions (e.g., CLIP‑ViT‑L/14) is feasible with modest compute overhead.
Empirical validation of VFM generative power: Provides extensive ablations showing that VFM features retain enough detail for high‑quality generation, challenging the assumption that a VAE is mandatory.

Methodology

Feature extractor (SVG encoder): A self‑supervised vision transformer (e.g., CLIP‑ViT) is frozen and used to map images into a dense latent space (≈1024‑dimensional tokens).
Latent diffusion model: A standard UNet‑based diffusion backbone is trained to predict noise in the SVG latent space conditioned on CLIP‑text embeddings. The diffusion schedule and loss are identical to Latent Diffusion Models (LDMs), but the “image” being denoised is now a sequence of VFM tokens.
Decoder (SVG decoder): A lightweight transformer decoder reconstructs pixel images from the denoised latent tokens. Because the encoder is frozen, the decoder learns a deterministic mapping rather than a probabilistic VAE reconstruction.
Training pipeline: The authors scale the dataset to several hundred million image‑text pairs, using mixed‑precision and gradient checkpointing to keep GPU memory within 24 GB.
Evaluation: Generation quality is measured with two recent benchmarks—GenEval (semantic alignment) and DPG‑Bench (diversity‑perceptual quality)—plus human preference studies.

Results & Findings

Metric	SVG‑T2I	Comparable VAE‑based LDM
GenEval	0.75	0.73
DPG‑Bench	85.78	84.2
FID (256×256)	7.9	8.1
Inference latency (single GPU)	0.42 s	0.45 s

Semantic fidelity: The higher GenEval score indicates that the VFM latent space preserves textual semantics better than a learned VAE latent.
Diversity: DPG‑Bench shows SVG‑T2I generates a broader range of styles without sacrificing realism.
Efficiency: Removing the VAE encoder/decoder reduces the overall pipeline depth, yielding a modest speed‑up at inference time.
Ablations: Experiments varying the encoder depth, latent dimensionality, and diffusion steps confirm that most performance gains stem from the richer VFM representation rather than architectural tweaks.

Practical Implications

Simplified pipelines for developers: Teams can now plug a pre‑trained VFM (e.g., CLIP) into a diffusion model without maintaining a separate VAE, reducing code complexity and deployment footprint.
Better alignment for multimodal products: Since the same VFM is used for both understanding (e.g., image search) and generation, downstream services—content creation tools, advertising generators, or UI prototyping assistants—can achieve tighter text‑image consistency.
Lower storage & bandwidth: Latent tokens are far smaller than raw images, enabling efficient transmission of intermediate representations in distributed training or edge‑to‑cloud scenarios.
Foundation for “representation‑first” generative AI: The open‑source release encourages experimentation with other VFMs (e.g., DINOv2, MAE) and modalities (video, 3‑D), opening a path toward unified generative foundations.

Limitations & Future Work

Dependence on frozen VFM: The model inherits any biases or blind spots present in the underlying vision transformer; fine‑tuning the encoder could improve niche domains but would increase training cost.
Decoder quality ceiling: While the deterministic decoder works well for 256×256 outputs, scaling to ultra‑high resolutions may still benefit from a VAE‑style hierarchical decoder.
Compute‑intensive pre‑training: Scaling to billions of image‑text pairs still requires large‑scale GPU clusters, limiting accessibility for smaller labs.
Future directions: The authors suggest exploring joint training of encoder‑decoder to mitigate bias, integrating multimodal tokens (audio, depth), and applying the framework to conditional generation beyond text (e.g., sketches or segmentation maps).

Authors

Minglei Shi
Haolin Wang
Borui Zhang
Wenzhao Zheng
Bohan Zeng
Ziyang Yuan
Xiaoshi Wu
Yuanxing Zhang
Huan Yang
Xintao Wang
Pengfei Wan
Kun Gai
Jie Zhou
Jiwen Lu

Paper Information

arXiv ID: 2512.11749v1
Categories: cs.CV
Published: December 12, 2025
PDF: Download PDF

[Paper] SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Moment-Based 3D Gaussian Splatting: Resolving Volumetric Occlusion with Order-Independent Transmittance

[Paper] V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] AnchorDream: Repurposing Video Diffusion for Embodiment-Aware Robot Data Synthesis