[Paper] LouvreSAE: Sparse Autoencoders for Interpretable and Controllable Style Transfer

Published: (December 21, 2025 at 07:36 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.18930v1

Overview

LouvreSAE proposes a lightweight, interpretable way to capture and transfer artistic style using a Sparse Autoencoder (SAE) built on top of the latent space of existing generative image models. By learning a compact set of “style steering vectors” from just a handful of reference artworks, the method enables fast, fine‑tuning‑free style transfer that remains disentangled from the image’s content.

Key Contributions

  • Sparse Autoencoder for art – Trains an SAE on latent embeddings of a pre‑trained generator, yielding a sparse, interpretable basis of stylistic and compositional concepts.
  • Style profiles as steering vectors – Constructs low‑dimensional, decomposable vectors that can be added to any latent code to impose a desired style without updating the generator.
  • Zero‑fine‑tuning transfer – No LoRA adapters, prompt engineering, or extra optimization steps are required at inference time.
  • Speed & quality trade‑off – Achieves comparable or better VGG Style Loss and CLIP‑based style scores on the ArtBench10 benchmark while being 1.7–20× faster than prior concept‑based approaches.
  • Interpretability – Each sparse dimension corresponds to an intuitive visual factor (e.g., brushstroke thickness, palette hue, texture granularity), allowing developers to manually tweak or combine styles.

Methodology

  1. Latent extraction – Images (both photographs and artworks) are passed through a pre‑trained diffusion or GAN generator; the intermediate latent vectors are collected.
  2. Sparse Autoencoding – An autoencoder with an ℓ₁‑regularized bottleneck is trained on these latents. The sparsity forces the model to represent each image using only a few active dimensions, naturally separating style from content.
  3. Concept discovery – After training, each active dimension is inspected (via visualizing decoded outputs) and labeled as a stylistic or semantic factor (e.g., “impasto brushwork”, “cool‑blue palette”).
  4. Style profile creation – For a target style, the mean activation across a small set of reference artworks is computed, yielding a style steering vector.
  5. Style transfer – To stylize a new image, its latent code is simply added (or linearly blended) with the steering vector and decoded by the generator’s existing decoder. No weight updates, LoRA modules, or extra diffusion steps are needed.

Results & Findings

Metric (ArtBench10)LouvreSAEPrior Concept‑Based Methods
VGG Style Loss (lower = better)0.420.55 – 0.68
CLIP Score – Style (higher = better)0.710.63 – 0.68
Inference time per image≈ 0.12 s0.2 s – 2.4 s
  • Quality: LouvreSAE matches or exceeds style fidelity while preserving content structure.
  • Speed: Because the method only adds a vector and runs a single forward pass, it is up to 20× faster than approaches that require iterative optimization or adapter fine‑tuning.
  • Interpretability: Visual inspection shows that toggling individual sparse dimensions yields predictable changes (e.g., increasing “brushstroke width” thickens strokes without altering the scene layout).

Practical Implications

  • Rapid prototyping for creative tools – UI/UX designers can embed a “style picker” that instantly re‑styles user‑generated images with a single click, no GPU‑heavy fine‑tuning.
  • Batch processing pipelines – Studios can apply a consistent artistic signature to thousands of frames (e.g., for stylized video or game assets) with minimal compute overhead.
  • Fine‑grained control for developers – Because each dimension is semantically labeled, developers can expose sliders for “palette temperature” or “texture granularity,” enabling deterministic, reproducible style adjustments.
  • Low‑resource deployment – Since the method works on top of any off‑the‑shelf generator, it can be shipped to edge devices (mobile, WebGL) where model updates are impractical.
  • Cross‑domain style transfer – The same steering vectors can be applied to non‑artistic domains (e.g., medical imaging visualizations) to impose a desired visual language without contaminating diagnostic content.

Limitations & Future Work

  • Domain dependence – The SAE is trained on an art‑centric dataset; transferring to highly divergent domains (e.g., satellite imagery) may require re‑training or domain adaptation.
  • Granularity of concepts – While many dimensions map cleanly to visual factors, some remain entangled, limiting precise control for very subtle style nuances.
  • Scalability of concept labeling – Manual inspection was used to name dimensions; automating this step could accelerate adoption.
  • Future directions suggested by the authors include: extending the sparse basis to multimodal inputs (e.g., text‑guided style cues), integrating with diffusion‑based generators for higher‑resolution outputs, and exploring hierarchical sparsity to capture style at multiple spatial scales.

Authors

  • Raina Panda
  • Daniel Fein
  • Arpita Singhal
  • Mark Fiore
  • Maneesh Agrawala
  • Matyas Bohacek

Paper Information

  • arXiv ID: 2512.18930v1
  • Categories: cs.CV, cs.AI, cs.GR
  • Published: December 22, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »