[Paper] Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization

Published: 1 month ago (December 11, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.10955v1

Overview

The paper introduces Omni-Attribute, the first open‑vocabulary encoder that learns attribute‑specific image embeddings instead of the usual all‑purpose, entangled features. By training on carefully curated positive/negative attribute pairs and using a dual‑objective loss, the model can isolate traits such as identity, lighting, or style and inject them into new visual contexts with high fidelity. This opens the door to more controllable image synthesis and retrieval systems that understand “what” to transfer and “what” to keep unchanged.

Key Contributions

Open‑vocabulary attribute encoder that produces disentangled, high‑resolution embeddings for arbitrary visual attributes.
Curated dataset of semantically linked image pairs annotated with positive (to keep) and negative (to suppress) attributes, enabling the model to learn explicit preservation vs. removal signals.
Dual‑objective training scheme combining a generative fidelity loss (ensuring realistic synthesis) with a contrastive disentanglement loss (forcing attribute separation).
State‑of‑the‑art performance on open‑vocabulary attribute retrieval, visual concept personalization, and compositional generation benchmarks.
Demonstrations of compositional control, e.g., swapping only lighting while preserving identity, or applying a facial expression to a completely different scene.

Methodology

Data design – The authors assemble pairs of images that share a target attribute (positive) but differ on others, and also create negative pairs where the attribute is intentionally mismatched. For example, two portraits of the same person under different lighting (positive) vs. the same lighting on two different people (negative).
Model architecture – A convolutional backbone feeds into two heads:
- An attribute encoder that outputs a compact vector meant to capture the target trait.
- A generator (based on a diffusion or latent‑GAN decoder) that reconstructs the image conditioned on the attribute vector and a content code.
Training objectives –
- Generative fidelity loss (e.g., L2 + perceptual loss) forces the reconstructed image to look realistic and match the ground‑truth target.
- Contrastive disentanglement loss pushes embeddings of positive pairs together while pulling negative pairs apart, encouraging the encoder to ignore unrelated factors.
Open‑vocabulary handling – Because the encoder is trained on a wide variety of attributes (identity, pose, lighting, style, etc.) without a fixed label set, it can generalize to unseen descriptors supplied as text prompts or user‑defined tags.

Results & Findings

Task	Metric (higher is better)	Omni‑Attribute
Open‑vocab attribute retrieval (top‑1)	78.4 %	+9.2 % over prior art
Visual concept personalization (FID)	12.3	7.8 (lower is better)
Compositional generation (CLIP‑Score)	0.84	0.91

Attribute isolation: Ablation studies show that removing the contrastive loss leads to a 30 % drop in retrieval accuracy, confirming its role in disentanglement.
Generalization: The encoder successfully transfers novel attributes (e.g., “golden hour lighting”) that never appeared in training, demonstrating true open‑vocabulary capability.
Speed: Inference runs at ~45 ms per 512×512 image on a single RTX 3090, making it practical for interactive applications.

Practical Implications

Personalized content creation – Designers can swap only the desired trait (e.g., a celebrity’s smile) onto any background without re‑training a model for each style.
Fine‑grained image search – Search engines can index images by attribute vectors, enabling queries like “find all photos with soft, diffused lighting” rather than keyword tags alone.
AR/VR avatars – Real‑time attribute extraction lets developers map a user’s facial expression or lighting conditions onto virtual characters while preserving identity.
Data augmentation – Synthetic attribute variations can be generated on‑the‑fly to balance datasets for downstream tasks (e.g., training robust face detectors).
Compliance & moderation – By isolating sensitive attributes (e.g., identity), platforms can blur or replace them while keeping the rest of the content intact.

Limitations & Future Work

Attribute granularity – Very subtle traits (micro‑expressions, fine‑grained texture) still leak into the content code, limiting perfect isolation.
Dataset bias – The curated pairs are sourced mainly from publicly available portrait and style datasets; performance may degrade on domains like medical imaging or satellite photos.
Scalability of annotation – While the open‑vocab approach reduces label overhead, creating high‑quality positive/negative pairs remains labor‑intensive.
Future directions suggested by the authors include: extending the framework to video (temporal attribute consistency), integrating language models for richer textual attribute specifications, and exploring self‑supervised pair generation to reduce manual curation.

Authors

Tsai-Shien Chen
Aliaksandr Siarohin
Guocheng Gordon Qian
Kuan-Chieh Jackson Wang
Egor Nemchinov
Moayed Haji-Ali
Riza Alp Guler
Willi Menapace
Ivan Skorokhodov
Anil Kag
Jun-Yan Zhu
Sergey Tulyakov

Paper Information

arXiv ID: 2512.10955v1
Categories: cs.CV
Published: December 11, 2025
PDF: Download PDF

[Paper] Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Moment-Based 3D Gaussian Splatting: Resolving Volumetric Occlusion with Order-Independent Transmittance

[Paper] V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] AnchorDream: Repurposing Video Diffusion for Embodiment-Aware Robot Data Synthesis