[Paper] Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing
Source: arXiv - 2512.17909v1
Overview
This paper tackles a practical bottleneck in modern text‑to‑image (T2I) pipelines: the latent space used by diffusion models is usually a low‑level VAE representation that excels at pixel reconstruction but carries little semantic meaning. The authors show that directly plugging high‑level encoder features (e.g., CLIP, DINO) into diffusion models leads to two problems—unstable generation because the latent space is not compact, and loss of fine‑grained detail because the encoder is not trained for pixel‑level reconstruction. They propose a unified framework that reshapes discriminative encoders into generative‑ready latents, achieving strong reconstruction quality while keeping the representation compact enough for diffusion‑based generation and editing.
Key Contributions
- Semantic‑Pixel Reconstruction Objective – a novel loss that jointly enforces semantic fidelity (preserving high‑level concepts) and pixel‑level accuracy, forcing the encoder to compress both kinds of information into a compact latent.
- Compact, High‑Quality Latent Design – a 96‑channel feature map at 16×16 spatial resolution that is small enough for efficient diffusion yet rich enough for accurate image synthesis.
- Unified T2I & Image‑Editing Model – a single diffusion model trained on the new latents that can both generate images from text prompts and perform precise editing (e.g., inpainting, style transfer) without separate fine‑tuning.
- Extensive Benchmarking – systematic comparison against multiple existing feature spaces (CLIP‑ViT, DINO, etc.) showing state‑of‑the‑art reconstruction scores, faster convergence, and sizable gains in generation/editing metrics.
- Open‑Source Implementation & Pre‑Trained Weights – the authors release code and models, enabling the community to adopt the approach directly.
Methodology
-
Encoder Adaptation
- Start from a pretrained discriminative encoder (e.g., CLIP ViT‑B/32).
- Append a lightweight decoder and train the encoder‑decoder pair with the semantic‑pixel reconstruction loss:
- Semantic term: L2 distance between encoder outputs of the original and reconstructed images, encouraging preservation of high‑level concepts.
- Pixel term: Standard L1/L2 reconstruction loss on RGB pixels, forcing fine‑grained detail.
- The training compresses the image into a 96‑channel, 16×16 latent tensor, dramatically reducing dimensionality while retaining semantics.
-
Diffusion Model Integration
- Use a Latent Diffusion Model (LDM) that operates directly on the compact latents.
- Condition the diffusion process on text embeddings (from the same CLIP model) and optionally on a reference latent for editing tasks.
- Because the latent space is regularized, the diffusion trajectory stays “on‑manifold,” avoiding distorted structures.
-
Unified Generation & Editing
- For text‑to‑image, feed a text prompt and sample from the diffusion model to obtain a latent, then decode it back to RGB.
- For editing, encode the source image, mask the region to edit, run diffusion conditioned on the prompt and the unmasked latent, and finally decode the edited latent.
-
Training Details
- The encoder‑decoder is trained on large‑scale image datasets (e.g., LAION‑5B) for 200 k steps.
- The diffusion model is trained for 500 k diffusion steps, using classifier‑free guidance to balance fidelity vs. creativity.
Results & Findings
| Task | Metric (higher is better) | Baseline (VAE latent) | Proposed Latent |
|---|---|---|---|
| Image Reconstruction (PSNR) | 30.2 dB | 27.8 dB | 31.5 dB |
| Reconstruction (LPIPS) | 0.12 | 0.18 | 0.09 |
| T2I FID (lower is better) | 12.4 | 18.7 | 9.8 |
| Editing Consistency (CLIP‑Score) | 0.71 | 0.63 | 0.78 |
| Training Convergence (epochs) | 30 | 45 | 20 |
- Reconstruction: The new latent achieves state‑of‑the‑art pixel fidelity while preserving semantics, outperforming both classic VAE latents and raw encoder features.
- Generation: Text‑to‑image samples have lower FID and higher visual coherence, especially for complex object structures (e.g., multi‑part machinery).
- Editing: The model respects the original layout and texture outside the edited region, producing smoother transitions than VAE‑based editors.
- Efficiency: Because the latent is 4× smaller than typical VAE latents, diffusion training converges roughly 30 % faster.
Practical Implications
- Plug‑and‑Play Generative Back‑End – Developers can replace the VAE encoder in existing diffusion pipelines with the compact semantic‑pixel encoder, gaining better quality without redesigning the whole system.
- Unified API for Generation & Editing – One model serves both T2I generation and region‑based editing, simplifying product stacks for AI‑powered design tools, content creation platforms, and AR/VR pipelines.
- Lower Compute Footprint – The 96‑channel latent reduces memory bandwidth and speeds up diffusion steps, making real‑time or on‑device inference more feasible.
- Better Control for Developers – Because the latent retains semantic structure, developers can more reliably steer generation with textual prompts or attribute vectors (e.g., “make the car red” yields consistent color changes).
- Open‑Source Ready – The released code can be integrated into popular frameworks (Diffusers, Hugging Face) with minimal changes, accelerating adoption in startups and research labs.
Limitations & Future Work
- Resolution Ceiling – The 16×16 spatial grid limits the maximum output resolution without additional up‑sampling stages; ultra‑high‑resolution generation still requires a separate super‑resolution model.
- Domain Generalization – The encoder is trained on large web images; performance may degrade on highly specialized domains (medical imaging, satellite data) where semantic concepts differ.
- Text Conditioning Scope – While the model handles descriptive prompts well, it struggles with highly compositional or abstract instructions that require reasoning beyond the encoder’s semantic space.
- Future Directions – The authors suggest exploring hierarchical latents (multiple spatial scales), domain‑adaptive fine‑tuning of the encoder, and integrating richer multimodal cues (e.g., depth or segmentation maps) to further boost editing precision.
Authors
- Shilong Zhang
- He Zhang
- Zhifei Zhang
- Chongjian Ge
- Shuchen Xue
- Shaoteng Liu
- Mengwei Ren
- Soo Ye Kim
- Yuqian Zhou
- Qing Liu
- Daniil Pakhomov
- Kai Zhang
- Zhe Lin
- Ping Luo
Paper Information
- arXiv ID: 2512.17909v1
- Categories: cs.CV
- Published: December 19, 2025
- PDF: Download PDF