[Paper] NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries

Published: 16 hours ago (March 5, 2026 at 01:12 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2603.05446v1

Overview

The paper introduces NaiLIA, a multimodal retrieval system that lets users find nail‑design images by typing rich, natural‑language intent descriptions and optionally selecting colors from a palette. By tightly coupling dense textual cues with precise color queries, NaiLIA bridges the gap between vague fashion‑style language and the visual specifics that nail‑art enthusiasts care about.

Key Contributions

Dense intent description handling – a novel dataset of 10.6 K nail‑design images annotated with long, multi‑aspect textual intents (e.g., “glittery rose‑gold base with pastel floral accents”).
Palette‑aware retrieval – integration of zero‑to‑many user‑chosen colors, enabling fine‑grained matching of subtle hue nuances.
Relaxed confidence‑based loss – a training objective that treats unlabeled images as “potentially relevant” with a confidence score, improving alignment between language, color, and visual features.
Benchmark & evaluation – the first public benchmark for multimodal nail‑design retrieval, complete with cross‑cultural annotations and a thorough empirical comparison against standard vision‑language models.

Methodology

Data collection – 10,625 nail‑design photos were gathered from a globally diverse user base. Over 200 annotators wrote dense intent descriptions covering base polish, embellishments, themes, and overall impressions.
Dual‑branch encoder
- Vision branch: a CNN/ViT backbone extracts a visual embedding for each image.
- Text‑+‑palette branch: a transformer encoder ingests the intent description and a set of RGB color vectors (the palette). The colors are embedded via a small MLP and concatenated with the tokenized text before feeding into the transformer.
Cross‑modal alignment – embeddings from both branches are projected into a shared space. Retrieval is performed by nearest‑neighbor search (dot‑product similarity) between a query embedding (text + palette) and image embeddings.
Relaxed loss – instead of a hard binary label (“relevant / not relevant”), the authors assign a confidence score to each unlabeled image based on its similarity to the query’s textual keywords. The loss encourages the model to pull up high‑confidence images while still penalizing obvious mismatches.
Training pipeline – the system is fine‑tuned on the collected dataset, using standard contrastive learning tricks (temperature scaling, hard negative mining) plus the confidence‑weighted loss.

Results & Findings

Model	Recall@10	Recall@50
CLIP (baseline)	31.2 %	58.7 %
BLIP‑2	34.5 %	61.9 %
NaiLIA (proposed)	42.8 %	70.3 %

Palette queries boost performance: Adding even a single color improves Recall@10 by ~5 pts over text‑only queries.
Confidence loss matters: Ablating the relaxed loss drops Recall@10 by ~3 pts, confirming its role in handling noisy, unlabeled data.
Cross‑cultural robustness: Performance remains stable across subsets grouped by annotator nationality, indicating the model captures universal design semantics rather than over‑fitting to a single cultural style.

Practical Implications

E‑commerce & recommendation engines – Online nail‑salon platforms can let shoppers type “elegant matte navy with silver studs” and pick a navy shade, instantly surfacing matching products or tutorials.
Design tools for creators – Graphic‑design software can embed NaiLIA to suggest ready‑made nail‑art assets that fit a designer’s mood board and color palette, speeding up prototyping.
Personalized marketing – Brands can analyze aggregated intent‑palette queries to spot emerging trends (e.g., a surge in “peach‑blush gradient with holographic flakes”) and adjust inventory or ad creatives accordingly.
Low‑code integration – Because the retrieval reduces to a similarity search over pre‑computed embeddings, developers can plug NaiLIA into existing vector‑database stacks (e.g., Pinecone, Milvus) with minimal latency overhead.

Limitations & Future Work

Domain specificity – The model is trained solely on nail‑design images; transferring to other fashion accessories (e.g., eye‑makeup, clothing) would require additional data.
Palette granularity – While the system accepts multiple colors, it does not yet model spatial distribution (e.g., “gradient from left to right”). Incorporating positional color cues could further improve relevance.
User intent ambiguity – Dense textual descriptions can still be subjective; the confidence‑based loss mitigates but does not eliminate mismatches when annotator intent diverges from visual reality.
Scalability of annotation – Building similar datasets for new domains demands large‑scale, high‑quality human annotations, which may be costly. Future work could explore weak supervision or synthetic caption generation to reduce this burden.

Bottom line: NaiLIA demonstrates that marrying rich natural‑language intent with precise color selection yields a powerful, industry‑ready retrieval system for visual design assets. For developers building next‑generation fashion‑tech experiences, the paper offers both a concrete architecture and a publicly available benchmark to kick‑start multimodal search projects.

Authors

Kanon Amemiya
Daichi Yashima
Kei Katsumata
Takumi Komatsu
Ryosuke Korekata
Seitaro Otsuki
Komei Sugiura

Paper Information

arXiv ID: 2603.05446v1
Categories: cs.CV
Published: March 5, 2026
PDF: Download PDF

[Paper] NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Transformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups

[Paper] FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning

[Paper] Accelerating Text-to-Video Generation with Calibrated Sparse Attention

[Paper] Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline