[Paper] SEAL: Semantic-aware Single-image Sticker Personalization with a Large-scale Sticker-tag Dataset

Published: 5 days ago (April 29, 2026 at 12:52 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.26883v1

Overview

The paper SEAL: Semantic‑aware Single‑image Sticker Personalization with a Large‑scale Sticker‑tag Dataset tackles a practical problem many developers face when using diffusion models for custom sticker generation: how to reliably adapt a model to a single reference image while still allowing fine‑grained control over attributes like emotion, style, or background. The authors introduce a lightweight plug‑in module (SEAL) that can be dropped into existing text‑to‑image pipelines and a new, richly annotated sticker dataset (StickerBench) that makes systematic testing possible.

Key Contributions

SEAL module – an architecture‑agnostic adaptation layer that mitigates over‑fitting (visual entanglement) and rigidity in test‑time fine‑tuning (TTF) for single‑image personalization.
Three novel components within SEAL:
1. Semantic‑guided Spatial Attention Loss – forces the model to focus on the target object’s semantics rather than background pixels.
2. Split‑merge Token Strategy – separates identity tokens from context tokens during embedding adaptation, then recombines them for generation.
3. Structure‑aware Layer Restriction – limits updates to diffusion layers that are most responsible for spatial layout, preserving controllability.
StickerBench dataset – over 30 k sticker images annotated with a six‑attribute schema (Appearance, Emotion, Action, Camera Composition, Style, Background), providing a standardized benchmark for identity preservation vs. contextual flexibility.
Plug‑and‑play compatibility – SEAL works with any U‑Net‑based diffusion model (e.g., Stable Diffusion, Imagen) without requiring architectural changes.
Empirical validation – extensive experiments show consistent gains in identity preservation (↑ 12 % CLIP‑ID score) while keeping prompt‑driven attribute control intact.

Methodology

Embedding Adaptation – When a user supplies a single sticker image, its CLIP image embedding is extracted. SEAL injects three regularizers during the short TTF step:
- Semantic‑guided Spatial Attention Loss computes a spatial attention map from a pretrained segmentation model and penalizes changes to background regions.
- Split‑merge Token Strategy splits the embedding into “identity” and “context” sub‑vectors, updates them separately, then merges them back, preventing the model from conflating the two.
- Structure‑aware Layer Restriction freezes diffusion layers that mainly encode global layout, only allowing lower‑level layers (responsible for texture) to adapt.
Training / Fine‑tuning – SEAL does not require any extra training data; it operates entirely at test time, typically for 10–20 gradient steps.
Evaluation with StickerBench – The dataset’s structured tags let the authors generate variations (e.g., same character with different emotions) and measure two metrics:
- Identity Preservation (CLIP‑ID similarity between generated and reference stickers).
- Contextual Controllability (accuracy of attribute prediction from generated images).

Results & Findings

Metric	Baseline TTF	TTF + SEAL
CLIP‑ID similarity (higher = better)	0.71	0.80 (+12 %)
Attribute accuracy (average across 6 tags)	0.84	0.83 (≈ no loss)
Visual Entanglement (background leakage)	18 % of samples	5 %
Structural Rigidity (failure to follow pose prompts)	22 %	9 %

Key takeaways

Identity stays recognizable even when prompts drastically change style, background, or camera angle.
Control over attributes is preserved; SEAL does not sacrifice the flexibility that diffusion models are prized for.
The three components act synergistically—ablation studies show each contributes roughly 3–5 % of the total gain.

Practical Implications

Developer‑ready personalization – SEAL can be added to existing diffusion APIs (e.g., Hugging Face Diffusers) with a few lines of code, enabling SaaS platforms to offer “single‑image sticker creator” features without massive fine‑tuning pipelines.
Reduced compute cost – Because SEAL only updates a subset of layers for a handful of steps, the overhead is negligible (≈ 0.2 s on an RTX 3080 per sticker).
Better brand consistency – Companies can generate on‑brand stickers from a single logo or mascot while still allowing users to request different moods, actions, or backgrounds, keeping the core visual identity intact.
Dataset as a benchmark – StickerBench can serve as a standard test suite for any future personalization method, encouraging reproducibility and fair comparison.
Potential extensions – The same semantic‑aware adaptation ideas could be applied to other single‑image domains such as avatars, emojis, or UI icons.

Limitations & Future Work

Domain specificity – SEAL is evaluated only on sticker‑style graphics; performance on highly photorealistic or 3D objects remains untested.
Reliance on external segmentation – The spatial attention loss depends on a pre‑trained segmentation model; errors in segmentation could propagate to the adaptation step.
Attribute granularity – While StickerBench covers six high‑level tags, finer‑grained control (e.g., subtle facial expressions) may still be challenging.
Future directions suggested by the authors include: extending SEAL to multi‑modal references (e.g., video clips), integrating learned attention maps instead of external segmenters, and exploring larger‑scale user studies to quantify perceived quality in real‑world sticker creation tools.

Authors

Changhyun Roh
Yonghyun Jeong
Jonghyun Lee
Chanho Eom
Jihyong Oh

Paper Information

arXiv ID: 2604.26883v1
Categories: cs.CV
Published: April 29, 2026
PDF: Download PDF

[Paper] SEAL: Semantic-aware Single-image Sticker Personalization with a Large-scale Sticker-tag Dataset

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Posterior Augmented Flow Matching

[Paper] Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

[Paper] Let ViT Speak: Generative Language-Image Pre-training

[Paper] GMGaze: MoE-Based Context-Aware Gaze Estimation with CLIP and Multiscale Transformer