[Paper] Laminating Representation Autoencoders for Efficient Diffusion
Source: arXiv - 2602.04873v1
Overview
A new paper from Ramón Calvo‑González and François Fleuret shows how to make diffusion‑based image generation far more efficient by compressing the rich patch‑level features produced by state‑of‑the‑art self‑supervised encoders (e.g., DINOv2). Their FlatDINO variational autoencoder squeezes a dense 2‑D grid of visual tokens into a short 1‑D sequence of just 32 continuous embeddings, cutting the diffusion model’s compute budget dramatically while preserving generation quality.
Key Contributions
- FlatDINO VAE: A lightweight variational autoencoder that transforms DINOv2 patch embeddings (≈ 256 × 256 × dim) into a 32‑token latent vector, achieving an 8× reduction in sequence length and ~48× reduction in total dimensionality.
- Efficient Diffusion Training: Demonstrates that a DiT‑XL diffusion model trained on FlatDINO latents reaches a gFID of 1.80 on ImageNet‑256, matching the quality of diffusion on raw DINOv2 features.
- Compute Savings: Shows up to 8× fewer FLOPs per forward pass and 4.5× fewer FLOPs per training step compared with using uncompressed DINOv2 features.
- Proof‑of‑Concept Pipeline: Integrates a self‑supervised encoder → FlatDINO → diffusion model, providing a practical recipe for developers who already rely on DINO‑style representations.
Methodology
-
Feature Extraction
Images are first passed through a pretrained DINOv2 encoder, yielding a dense grid of patch embeddings (e.g., 16 × 16 patches each with a 1024‑dim vector). -
Variational Compression (FlatDINO)
- A convolutional encoder aggregates the 2‑D grid into a compact latent distribution (mean + log‑var).
- Sampling from this distribution produces a fixed‑length 1‑D sequence of 32 tokens (each token ≈ 256‑dim).
- A symmetric decoder reconstructs the original patch grid, and the VAE is trained with a standard reconstruction loss plus a KL‑divergence regularizer.
-
Diffusion on Compressed Latents
The 32‑token sequence is fed to a DiT‑XL (a transformer‑based diffusion model). Because the sequence is short, attention and feed‑forward layers cost far less. -
Guidance & Sampling
Standard classifier‑free guidance is applied during sampling to trade off fidelity vs. diversity, exactly as in conventional diffusion pipelines.
Results & Findings
| Metric | Diffusion on raw DINOv2 | Diffusion on FlatDINO (this work) |
|---|---|---|
| gFID (ImageNet‑256) | ~1.7‑1.9 (baseline) | 1.80 |
| Sequence length | 256 (16 × 16) | 32 |
| FLOPs per forward pass | 1× (baseline) | ≈ 1/8 |
| FLOPs per training step | 1× (baseline) | ≈ 1/4.5 |
The compressed representation retains enough semantic detail for the diffusion model to synthesize high‑quality images, while the reduced token count slashes both memory usage and compute. Qualitative samples (as shown in the paper) are visually indistinguishable from those generated from the full DINOv2 grid.
Practical Implications
- Cost‑Effective Scaling: Companies can train larger diffusion models (or run more training epochs) on the same hardware budget, thanks to the 4‑5× training speedup.
- Edge & Mobile Deployment: The 32‑token latent is tiny enough to be stored or transmitted efficiently, opening doors for on‑device generation where bandwidth or storage is limited.
- Hybrid Pipelines: Existing DINOv2‑based vision systems (e.g., retrieval, segmentation) can reuse the same encoder, then switch to FlatDINO for generative tasks without retraining the encoder.
- Reduced Memory Footprint: Shorter sequences mean lower GPU memory consumption, enabling higher batch sizes or the use of consumer‑grade GPUs for research and prototyping.
- Plug‑and‑Play: The VAE is trained separately, so developers can swap in alternative self‑supervised encoders (e.g., MAE, CLIP) and still reap similar compression benefits.
Limitations & Future Work
- Preliminary Results: The authors note that experiments are still early; broader benchmarks (e.g., higher resolutions, other datasets) are needed to confirm generality.
- Reconstruction Trade‑off: Compressing to 32 tokens inevitably discards some fine‑grained detail; edge cases with intricate textures may suffer.
- Encoder Dependency: FlatDINO is tuned for DINOv2 features; adapting it to other encoders may require architectural tweaks.
- Guidance Sensitivity: The optimal classifier‑free guidance weight may differ from that used with raw features, requiring extra hyper‑parameter tuning.
- Future Directions: The authors plan to explore adaptive token counts, hierarchical VAEs, and joint training of encoder‑decoder‑diffusion for end‑to‑end optimization.
Bottom line: By “laminating” a VAE on top of self‑supervised patch embeddings, FlatDINO delivers a compact, diffusion‑ready representation that slashes compute without sacrificing image quality—an exciting step toward making high‑fidelity generative models more accessible for everyday developers.
Authors
- Ramón Calvo‑González
- François Fleuret
Paper Information
- arXiv ID: 2602.04873v1
- Categories: cs.CV
- Published: February 4, 2026
- PDF: Download PDF