[Paper] One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation
Source: arXiv - 2512.07829v1
Overview
This paper introduces FAE (Feature Auto‑Encoder), a lightweight framework that lets you plug any high‑quality pretrained visual encoder (e.g., DINO, SigLIP) into modern image‑generation models such as diffusion models or normalizing flows. By using just a single attention layer to bridge the gap between high‑dimensional “understanding‑friendly” features and low‑dimensional “generation‑friendly” latents, FAE achieves state‑of‑the‑art image quality while dramatically simplifying the adaptation pipeline.
Key Contributions
- One‑layer adaptation: Shows that a single attention layer is sufficient to compress pretrained features into a generative‑ready latent space.
- Dual‑decoder architecture: Couples a reconstruction decoder (to preserve the original feature semantics) with a generation decoder (to synthesize images), enabling joint training without complex losses.
- Encoder‑agnostic design: Works with a variety of self‑supervised encoders (DINO, SigLIP, etc.), making the approach reusable across projects.
- Model‑agnostic integration: Demonstrated on both diffusion models and normalizing‑flow generators, proving the method’s flexibility.
- Strong empirical results: Near‑state‑of‑the‑art FID scores on ImageNet‑256 (1.29 with classifier‑free guidance, 1.48 without) using far fewer training epochs than typical baselines.
Methodology
- Pretrained Feature Extraction – A frozen visual encoder processes an input image and outputs a high‑dimensional feature map (e.g., 768‑dim DINO tokens).
- Feature Auto‑Encoder (FAE) –
- Compression Layer: A single multi‑head attention module reduces the feature map to a low‑dimensional latent (e.g., 64‑dim).
- Reconstruction Decoder: Takes the compressed latent, expands it back, and tries to reconstruct the original feature map using an L2 loss.
- Generation Decoder: Receives the same reconstructed features and feeds them into a downstream generative model (diffusion or flow) that produces the final image.
- Joint Training – Both decoders are trained simultaneously. The reconstruction loss keeps the latent faithful to the pretrained semantics, while the generative loss (e.g., diffusion denoising objective) ensures the latent is suitable for high‑quality synthesis.
- Plug‑and‑Play: Because the encoder stays frozen and the compression layer is tiny, swapping in a different pretrained encoder or a different generator requires only minor re‑initialisation.
Results & Findings
| Dataset / Setting | Model | CFG? | FID (800‑epoch) | FID (80‑epoch) |
|---|---|---|---|---|
| ImageNet‑256 | Diffusion + FAE | Yes | 1.29 (near‑SOTA) | 1.70 |
| ImageNet‑256 | Diffusion + FAE | No | 1.48 (SOTA) | 2.08 |
- Fast convergence: Even with only 80 training epochs, FAE reaches competitive FID scores, highlighting the efficiency of reusing pretrained representations.
- Robust across tasks: The same pipeline works for class‑conditional generation and text‑to‑image setups, showing its generality.
- Low overhead: Adding a single attention layer adds negligible parameters and compute compared to the full generator, yet yields a large quality boost.
Practical Implications
- Rapid prototyping: Teams can leverage existing self‑supervised vision models instead of training a new encoder from scratch, cutting down development time.
- Resource‑efficient training: Because the bulk of the visual knowledge is frozen, most of the training budget goes to the generative part, enabling high‑quality results on modest GPU budgets.
- Modular pipelines: FAE’s plug‑and‑play nature fits well with existing ML infrastructure—swap in a newer encoder (e.g., CLIP‑based) or a different diffusion backbone without redesigning the whole system.
- Better downstream control: Preserving the original feature semantics through the reconstruction decoder can be leveraged for conditional generation, style transfer, or editing tasks that rely on semantic consistency.
Limitations & Future Work
- Frozen encoder assumption: The current design keeps the pretrained encoder fixed; fine‑tuning it jointly with the generator could further improve performance but was not explored.
- Latent dimensionality trade‑off: While a single attention layer works well, the optimal latent size may vary across datasets and generators; automatic tuning is left for future research.
- Scope of benchmarks: Experiments focus on ImageNet‑256 and a few text‑to‑image setups; broader evaluation on higher‑resolution datasets (e.g., LSUN, COCO) and other generative families (GANs, VQ‑VAEs) would solidify the claims.
- Interpretability of compressed latents: Understanding how much semantic information survives the one‑layer compression remains an open question, which could guide more explainable generation pipelines.
FAE shows that you don’t need a heavyweight adapter to bridge the gap between powerful visual encoders and generative models—sometimes, one well‑placed attention layer is all it takes.
Authors
- Yuan Gao
- Chen Chen
- Tianrong Chen
- Jiatao Gu
Paper Information
- arXiv ID: 2512.07829v1
- Categories: cs.CV, cs.AI
- Published: December 8, 2025
- PDF: Download PDF