[Paper] Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

Published: 1 month ago (December 19, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.17909v1

Overview

This paper tackles a practical bottleneck in modern text‑to‑image (T2I) pipelines: the latent space used by diffusion models is usually a low‑level VAE representation that excels at pixel reconstruction but carries little semantic meaning. The authors show that directly plugging high‑level encoder features (e.g., CLIP, DINO) into diffusion models leads to two problems—unstable generation because the latent space is not compact, and loss of fine‑grained detail because the encoder is not trained for pixel‑level reconstruction. They propose a unified framework that reshapes discriminative encoders into generative‑ready latents, achieving strong reconstruction quality while keeping the representation compact enough for diffusion‑based generation and editing.

Key Contributions

Semantic‑Pixel Reconstruction Objective – a novel loss that jointly enforces semantic fidelity (preserving high‑level concepts) and pixel‑level accuracy, forcing the encoder to compress both kinds of information into a compact latent.
Compact, High‑Quality Latent Design – a 96‑channel feature map at 16×16 spatial resolution that is small enough for efficient diffusion yet rich enough for accurate image synthesis.
Unified T2I & Image‑Editing Model – a single diffusion model trained on the new latents that can both generate images from text prompts and perform precise editing (e.g., inpainting, style transfer) without separate fine‑tuning.
Extensive Benchmarking – systematic comparison against multiple existing feature spaces (CLIP‑ViT, DINO, etc.) showing state‑of‑the‑art reconstruction scores, faster convergence, and sizable gains in generation/editing metrics.
Open‑Source Implementation & Pre‑Trained Weights – the authors release code and models, enabling the community to adopt the approach directly.

Methodology

Encoder Adaptation
- Start from a pretrained discriminative encoder (e.g., CLIP ViT‑B/32).
- Append a lightweight decoder and train the encoder‑decoder pair with the semantic‑pixel reconstruction loss:
  - Semantic term: L2 distance between encoder outputs of the original and reconstructed images, encouraging preservation of high‑level concepts.
  - Pixel term: Standard L1/L2 reconstruction loss on RGB pixels, forcing fine‑grained detail.
- The training compresses the image into a 96‑channel, 16×16 latent tensor, dramatically reducing dimensionality while retaining semantics.
Diffusion Model Integration
- Use a Latent Diffusion Model (LDM) that operates directly on the compact latents.
- Condition the diffusion process on text embeddings (from the same CLIP model) and optionally on a reference latent for editing tasks.
- Because the latent space is regularized, the diffusion trajectory stays “on‑manifold,” avoiding distorted structures.
Unified Generation & Editing
- For text‑to‑image, feed a text prompt and sample from the diffusion model to obtain a latent, then decode it back to RGB.
- For editing, encode the source image, mask the region to edit, run diffusion conditioned on the prompt and the unmasked latent, and finally decode the edited latent.
Training Details
- The encoder‑decoder is trained on large‑scale image datasets (e.g., LAION‑5B) for 200 k steps.
- The diffusion model is trained for 500 k diffusion steps, using classifier‑free guidance to balance fidelity vs. creativity.

Results & Findings

Task	Metric (higher is better)	Baseline (VAE latent)	Proposed Latent
Image Reconstruction (PSNR)	30.2 dB	27.8 dB	31.5 dB
Reconstruction (LPIPS)	0.12	0.18	0.09
T2I FID (lower is better)	12.4	18.7	9.8
Editing Consistency (CLIP‑Score)	0.71	0.63	0.78
Training Convergence (epochs)	30	45	20

Reconstruction: The new latent achieves state‑of‑the‑art pixel fidelity while preserving semantics, outperforming both classic VAE latents and raw encoder features.
Generation: Text‑to‑image samples have lower FID and higher visual coherence, especially for complex object structures (e.g., multi‑part machinery).
Editing: The model respects the original layout and texture outside the edited region, producing smoother transitions than VAE‑based editors.
Efficiency: Because the latent is 4× smaller than typical VAE latents, diffusion training converges roughly 30 % faster.

Practical Implications

Plug‑and‑Play Generative Back‑End – Developers can replace the VAE encoder in existing diffusion pipelines with the compact semantic‑pixel encoder, gaining better quality without redesigning the whole system.
Unified API for Generation & Editing – One model serves both T2I generation and region‑based editing, simplifying product stacks for AI‑powered design tools, content creation platforms, and AR/VR pipelines.
Lower Compute Footprint – The 96‑channel latent reduces memory bandwidth and speeds up diffusion steps, making real‑time or on‑device inference more feasible.
Better Control for Developers – Because the latent retains semantic structure, developers can more reliably steer generation with textual prompts or attribute vectors (e.g., “make the car red” yields consistent color changes).
Open‑Source Ready – The released code can be integrated into popular frameworks (Diffusers, Hugging Face) with minimal changes, accelerating adoption in startups and research labs.

Limitations & Future Work

Resolution Ceiling – The 16×16 spatial grid limits the maximum output resolution without additional up‑sampling stages; ultra‑high‑resolution generation still requires a separate super‑resolution model.
Domain Generalization – The encoder is trained on large web images; performance may degrade on highly specialized domains (medical imaging, satellite data) where semantic concepts differ.
Text Conditioning Scope – While the model handles descriptive prompts well, it struggles with highly compositional or abstract instructions that require reasoning beyond the encoder’s semantic space.
Future Directions – The authors suggest exploring hierarchical latents (multiple spatial scales), domain‑adaptive fine‑tuning of the encoder, and integrating richer multimodal cues (e.g., depth or segmentation maps) to further boost editing precision.

Authors

Shilong Zhang
He Zhang
Zhifei Zhang
Chongjian Ge
Shuchen Xue
Shaoteng Liu
Mengwei Ren
Soo Ye Kim
Yuqian Zhou
Qing Liu
Daniil Pakhomov
Kai Zhang
Zhe Lin
Ping Luo

Paper Information

arXiv ID: 2512.17909v1
Categories: cs.CV
Published: December 19, 2025
PDF: Download PDF

[Paper] Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Dexterous World Models

[Paper] Adversarial Robustness of Vision in Open Foundation Models

[Paper] Diffusion Forcing for Multi-Agent Interaction Sequence Modeling