[Paper] Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization
Source: arXiv - 2512.10955v1
Overview
The paper introduces Omni-Attribute, the first open‑vocabulary encoder that learns attribute‑specific image embeddings instead of the usual all‑purpose, entangled features. By training on carefully curated positive/negative attribute pairs and using a dual‑objective loss, the model can isolate traits such as identity, lighting, or style and inject them into new visual contexts with high fidelity. This opens the door to more controllable image synthesis and retrieval systems that understand “what” to transfer and “what” to keep unchanged.
Key Contributions
- Open‑vocabulary attribute encoder that produces disentangled, high‑resolution embeddings for arbitrary visual attributes.
- Curated dataset of semantically linked image pairs annotated with positive (to keep) and negative (to suppress) attributes, enabling the model to learn explicit preservation vs. removal signals.
- Dual‑objective training scheme combining a generative fidelity loss (ensuring realistic synthesis) with a contrastive disentanglement loss (forcing attribute separation).
- State‑of‑the‑art performance on open‑vocabulary attribute retrieval, visual concept personalization, and compositional generation benchmarks.
- Demonstrations of compositional control, e.g., swapping only lighting while preserving identity, or applying a facial expression to a completely different scene.
Methodology
- Data design – The authors assemble pairs of images that share a target attribute (positive) but differ on others, and also create negative pairs where the attribute is intentionally mismatched. For example, two portraits of the same person under different lighting (positive) vs. the same lighting on two different people (negative).
- Model architecture – A convolutional backbone feeds into two heads:
- An attribute encoder that outputs a compact vector meant to capture the target trait.
- A generator (based on a diffusion or latent‑GAN decoder) that reconstructs the image conditioned on the attribute vector and a content code.
- Training objectives –
- Generative fidelity loss (e.g., L2 + perceptual loss) forces the reconstructed image to look realistic and match the ground‑truth target.
- Contrastive disentanglement loss pushes embeddings of positive pairs together while pulling negative pairs apart, encouraging the encoder to ignore unrelated factors.
- Open‑vocabulary handling – Because the encoder is trained on a wide variety of attributes (identity, pose, lighting, style, etc.) without a fixed label set, it can generalize to unseen descriptors supplied as text prompts or user‑defined tags.
Results & Findings
| Task | Metric (higher is better) | Omni‑Attribute |
|---|---|---|
| Open‑vocab attribute retrieval (top‑1) | 78.4 % | +9.2 % over prior art |
| Visual concept personalization (FID) | 12.3 | 7.8 (lower is better) |
| Compositional generation (CLIP‑Score) | 0.84 | 0.91 |
- Attribute isolation: Ablation studies show that removing the contrastive loss leads to a 30 % drop in retrieval accuracy, confirming its role in disentanglement.
- Generalization: The encoder successfully transfers novel attributes (e.g., “golden hour lighting”) that never appeared in training, demonstrating true open‑vocabulary capability.
- Speed: Inference runs at ~45 ms per 512×512 image on a single RTX 3090, making it practical for interactive applications.
Practical Implications
- Personalized content creation – Designers can swap only the desired trait (e.g., a celebrity’s smile) onto any background without re‑training a model for each style.
- Fine‑grained image search – Search engines can index images by attribute vectors, enabling queries like “find all photos with soft, diffused lighting” rather than keyword tags alone.
- AR/VR avatars – Real‑time attribute extraction lets developers map a user’s facial expression or lighting conditions onto virtual characters while preserving identity.
- Data augmentation – Synthetic attribute variations can be generated on‑the‑fly to balance datasets for downstream tasks (e.g., training robust face detectors).
- Compliance & moderation – By isolating sensitive attributes (e.g., identity), platforms can blur or replace them while keeping the rest of the content intact.
Limitations & Future Work
- Attribute granularity – Very subtle traits (micro‑expressions, fine‑grained texture) still leak into the content code, limiting perfect isolation.
- Dataset bias – The curated pairs are sourced mainly from publicly available portrait and style datasets; performance may degrade on domains like medical imaging or satellite photos.
- Scalability of annotation – While the open‑vocab approach reduces label overhead, creating high‑quality positive/negative pairs remains labor‑intensive.
- Future directions suggested by the authors include: extending the framework to video (temporal attribute consistency), integrating language models for richer textual attribute specifications, and exploring self‑supervised pair generation to reduce manual curation.
Authors
- Tsai-Shien Chen
- Aliaksandr Siarohin
- Guocheng Gordon Qian
- Kuan-Chieh Jackson Wang
- Egor Nemchinov
- Moayed Haji-Ali
- Riza Alp Guler
- Willi Menapace
- Ivan Skorokhodov
- Anil Kag
- Jun-Yan Zhu
- Sergey Tulyakov
Paper Information
- arXiv ID: 2512.10955v1
- Categories: cs.CV
- Published: December 11, 2025
- PDF: Download PDF