[Paper] SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder
Source: arXiv - 2512.11749v1
Overview
The paper introduces SVG‑T2I, a new text‑to‑image diffusion model that operates directly on the latent space of a Visual Foundation Model (VFM) instead of the traditional pixel‑space autoencoder pipeline. By sidestepping the variational autoencoder (VAE) bottleneck, the authors demonstrate that large‑scale diffusion can be trained entirely in the VFM feature domain while still delivering high‑fidelity, semantically rich images.
Key Contributions
- VFM‑centric diffusion: First large‑scale diffusion model trained end‑to‑end on self‑supervised visual representations (SVG) without a VAE.
- Competitive quality: Achieves 0.75 GenEval and 85.78 DPG‑Bench scores, on par with state‑of‑the‑art text‑to‑image systems that rely on pixel‑level autoencoders.
- Open‑source ecosystem: Releases the full autoencoder, diffusion model, training scripts, inference pipelines, evaluation tools, and pretrained weights.
- Scalable architecture: Demonstrates that scaling the latent diffusion pipeline to VFM dimensions (e.g., CLIP‑ViT‑L/14) is feasible with modest compute overhead.
- Empirical validation of VFM generative power: Provides extensive ablations showing that VFM features retain enough detail for high‑quality generation, challenging the assumption that a VAE is mandatory.
Methodology
- Feature extractor (SVG encoder): A self‑supervised vision transformer (e.g., CLIP‑ViT) is frozen and used to map images into a dense latent space (≈1024‑dimensional tokens).
- Latent diffusion model: A standard UNet‑based diffusion backbone is trained to predict noise in the SVG latent space conditioned on CLIP‑text embeddings. The diffusion schedule and loss are identical to Latent Diffusion Models (LDMs), but the “image” being denoised is now a sequence of VFM tokens.
- Decoder (SVG decoder): A lightweight transformer decoder reconstructs pixel images from the denoised latent tokens. Because the encoder is frozen, the decoder learns a deterministic mapping rather than a probabilistic VAE reconstruction.
- Training pipeline: The authors scale the dataset to several hundred million image‑text pairs, using mixed‑precision and gradient checkpointing to keep GPU memory within 24 GB.
- Evaluation: Generation quality is measured with two recent benchmarks—GenEval (semantic alignment) and DPG‑Bench (diversity‑perceptual quality)—plus human preference studies.
Results & Findings
| Metric | SVG‑T2I | Comparable VAE‑based LDM |
|---|---|---|
| GenEval | 0.75 | 0.73 |
| DPG‑Bench | 85.78 | 84.2 |
| FID (256×256) | 7.9 | 8.1 |
| Inference latency (single GPU) | 0.42 s | 0.45 s |
- Semantic fidelity: The higher GenEval score indicates that the VFM latent space preserves textual semantics better than a learned VAE latent.
- Diversity: DPG‑Bench shows SVG‑T2I generates a broader range of styles without sacrificing realism.
- Efficiency: Removing the VAE encoder/decoder reduces the overall pipeline depth, yielding a modest speed‑up at inference time.
- Ablations: Experiments varying the encoder depth, latent dimensionality, and diffusion steps confirm that most performance gains stem from the richer VFM representation rather than architectural tweaks.
Practical Implications
- Simplified pipelines for developers: Teams can now plug a pre‑trained VFM (e.g., CLIP) into a diffusion model without maintaining a separate VAE, reducing code complexity and deployment footprint.
- Better alignment for multimodal products: Since the same VFM is used for both understanding (e.g., image search) and generation, downstream services—content creation tools, advertising generators, or UI prototyping assistants—can achieve tighter text‑image consistency.
- Lower storage & bandwidth: Latent tokens are far smaller than raw images, enabling efficient transmission of intermediate representations in distributed training or edge‑to‑cloud scenarios.
- Foundation for “representation‑first” generative AI: The open‑source release encourages experimentation with other VFMs (e.g., DINOv2, MAE) and modalities (video, 3‑D), opening a path toward unified generative foundations.
Limitations & Future Work
- Dependence on frozen VFM: The model inherits any biases or blind spots present in the underlying vision transformer; fine‑tuning the encoder could improve niche domains but would increase training cost.
- Decoder quality ceiling: While the deterministic decoder works well for 256×256 outputs, scaling to ultra‑high resolutions may still benefit from a VAE‑style hierarchical decoder.
- Compute‑intensive pre‑training: Scaling to billions of image‑text pairs still requires large‑scale GPU clusters, limiting accessibility for smaller labs.
- Future directions: The authors suggest exploring joint training of encoder‑decoder to mitigate bias, integrating multimodal tokens (audio, depth), and applying the framework to conditional generation beyond text (e.g., sketches or segmentation maps).
Authors
- Minglei Shi
- Haolin Wang
- Borui Zhang
- Wenzhao Zheng
- Bohan Zeng
- Ziyang Yuan
- Xiaoshi Wu
- Yuanxing Zhang
- Huan Yang
- Xintao Wang
- Pengfei Wan
- Kun Gai
- Jie Zhou
- Jiwen Lu
Paper Information
- arXiv ID: 2512.11749v1
- Categories: cs.CV
- Published: December 12, 2025
- PDF: Download PDF