[Paper] Towards Scalable Pre-training of Visual Tokenizers for Generation
Source: arXiv - 2512.13687v1
Overview
This paper tackles a hidden bottleneck in modern image generation pipelines: the visual tokenizer (often a VAE‑style encoder) that converts raw pixels into a compact latent representation. The authors show that the usual reconstruction‑only pre‑training creates latents that are good at reproducing low‑level details but poor at capturing high‑level semantics—exactly what downstream generators need. By redesigning the pre‑training objective to include contrastive image‑text alignment and self‑supervised learning, they build a tokenizer that scales gracefully with compute and dramatically speeds up downstream generation.
Key Contributions
- Identify the “pre‑training scaling problem” – standard reconstruction‑only training fails to improve generative quality even when massive compute is spent.
- Introduce VTP (Visual Tokenizer Pre‑training) – a unified framework that jointly optimizes:
- Image‑text contrastive loss (semantic alignment).
- Self‑supervised loss (e.g., masked image modeling).
- Reconstruction loss (pixel fidelity).
- Large‑scale empirical study demonstrating that semantic understanding is the primary driver of generation quality.
- Show strong scaling behavior: increasing FLOPs, model size, or data for VTP yields consistent FID improvements, unlike conventional autoencoders that plateau early.
- Release of pre‑trained tokenizers that achieve 78.2 % zero‑shot ImageNet accuracy, 0.36 rFID, and 4.1× faster convergence for downstream diffusion models (DiT) without any architectural changes.
Methodology
- Unified Loss Design – The tokenizer’s encoder is trained with a weighted sum of three objectives:
- Contrastive image‑text loss (similar to CLIP) forces the latent to encode semantics that align with natural language captions.
- Self‑supervised loss (e.g., masked patch prediction) encourages the model to infer missing visual information, promoting richer feature learning.
- Reconstruction loss (pixel‑wise L2 or perceptual loss) still ensures the latent can be decoded back to a faithful image.
- Architecture – A standard Vision Transformer (ViT) backbone serves as the encoder; a lightweight decoder reconstructs images. The same encoder is later reused as the latent provider for diffusion‑based generators (DiT).
- Training Regime – Models are pre‑trained on massive image‑text datasets (e.g., LAION‑400M) using distributed training. Hyper‑parameters are tuned to balance the three losses, with a schedule that gradually shifts emphasis from reconstruction to semantic alignment as training progresses.
- Evaluation Pipeline – After pre‑training, the tokenizer is frozen and plugged into a DiT diffusion model trained on ImageNet. Generation quality is measured by FID, rFID, and convergence speed, while the tokenizer’s own representation quality is assessed via zero‑shot classification accuracy.
Results & Findings
| Metric | Conventional VAE (reconstruction only) | VTP (joint loss) |
|---|---|---|
| ImageNet zero‑shot accuracy | ~65 % | 78.2 % |
| rFID (reconstruction quality) | 0.48 | 0.36 |
| Generation convergence (DiT) | Baseline 100 % epochs | 4.1× faster |
| FID improvement vs. FLOPs (scaled) | Stagnates after ~10 % of total FLOPs | 65.8 % FID reduction when FLOPs are doubled |
Key takeaways
- Adding semantic contrastive loss yields latents that are far more useful for downstream generators.
- The tokenizer’s performance scales almost linearly with compute, data, and model size—something reconstruction‑only VAEs cannot achieve.
- Downstream diffusion models converge dramatically faster, saving both training time and cloud cost.
Practical Implications
- Faster Model Development – Teams can pre‑train a VTP tokenizer once and reuse it across multiple generative projects (image synthesis, in‑painting, style transfer), cutting down on repeated expensive training cycles.
- Better Zero‑Shot Transfer – The high semantic fidelity enables plug‑and‑play generation for new domains without fine‑tuning the tokenizer, useful for rapid prototyping in e‑commerce, gaming, or AR/VR content creation.
- Cost‑Effective Scaling – Because generative quality improves with additional pre‑training compute, organizations can invest in larger pre‑training runs (e.g., on public cloud) and reap proportional gains in downstream model performance, rather than hitting a hard ceiling.
- Compatibility – VTP works with existing diffusion frameworks (DiT, Stable Diffusion, etc.) without architectural changes, making integration straightforward for engineers already using those stacks.
- Open‑Source Availability – The released models and training scripts lower the barrier for startups and research labs to experiment with high‑quality visual tokenizers without building everything from scratch.
Limitations & Future Work
- Training Cost – While VTP scales well, the initial joint pre‑training still requires substantial GPU hours and large image‑text corpora, which may be prohibitive for small teams.
- Domain Specificity – The tokenizer is trained on broad internet data; performance on highly specialized domains (medical imaging, satellite imagery) may degrade without domain‑specific fine‑tuning.
- Loss Balancing – The optimal weighting between contrastive, self‑supervised, and reconstruction losses is empirically determined; a more principled or adaptive scheme could further improve robustness.
- Extension to Video – The paper focuses on still images; extending the unified loss framework to spatio‑temporal tokenizers for video generation is an open direction.
Overall, VTP demonstrates that “understanding” the visual world—via semantic alignment—is the key to unlocking scalable, high‑quality image generation.
Authors
- Jingfeng Yao
- Yuda Song
- Yucong Zhou
- Xinggang Wang
Paper Information
- arXiv ID: 2512.13687v1
- Categories: cs.CV
- Published: December 15, 2025
- PDF: Download PDF