[Paper] Visual Word Sense Disambiguation with CLIP through Dual-Channel Text Prompting and Image Augmentations
Source: arXiv - 2602.06799v1
Overview
This paper tackles a surprisingly practical problem: when a word has multiple meanings, can we pick the right image that matches the intended sense?
The authors build a lightweight Visual Word Sense Disambiguation (VWSD) system that sits on top of CLIP, enriches the text side with clever prompts, and applies modest image augmentations at inference time. On the SemEval‑2023 VWSD benchmark they push mean reciprocal rank (MRR) from 0.72 to 0.76 and improve hit‑rate by ~4 %, all with a model that runs in real‑time.
Key Contributions
- Dual‑channel text prompting: combines a semantic channel (WordNet synonyms) with a photo‑style channel (phrases like “a photo of …”) to create richer CLIP‑compatible queries.
- Test‑time image augmentation pipeline: applies robust, low‑cost transforms (cropping, color jitter, flips) to each candidate image before embedding, smoothing out visual noise.
- Simple similarity‑based inference: uses cosine similarity in CLIP’s joint space to rank candidate images, avoiding any fine‑tuning of the massive CLIP backbone.
- Comprehensive ablations: show that the dual‑prompt design yields the bulk of the gain, while aggressive augmentations add only marginal improvements.
- Exploratory multilingual & definition‑based prompts: demonstrate that noisy external signals (e.g., full WordNet glosses, translations) can actually hurt performance, underscoring the value of concise, CLIP‑aligned prompts.
Methodology
- Base model – CLIP: The authors start with a pre‑trained CLIP (ViT‑B/32) that already maps text and images into a common vector space. No additional training of CLIP weights is performed.
- Text enrichment:
- Semantic channel: For an ambiguous word (e.g., “bank”), they retrieve its WordNet synonyms (e.g., “financial institution”, “river edge”).
- Photo channel: They prepend a visual cue (“a photo of …”) to each synonym, turning pure lexical items into image‑friendly phrases.
- Both channels are encoded separately; the resulting vectors are averaged to form the final text embedding.
- Image processing: Each candidate image is passed through a set of deterministic augmentations (random resized crop, horizontal flip, slight color jitter). The augmented versions are encoded, and their embeddings are averaged, yielding a more stable image representation.
- Scoring: Cosine similarity between the enriched text vector and each image vector produces a ranking; the top‑ranked image is taken as the disambiguated sense.
- Evaluation: The system is tested on the SemEval‑2023 VWSD dataset, which provides a list of ambiguous words together with several candidate images per word. Standard VWSD metrics (MRR, Hit@1) are reported.
Results & Findings
| Metric | Baseline (raw CLIP) | + Dual‑channel prompts | + Image augmentations | Full system |
|---|---|---|---|---|
| MRR | 0.7227 | 0.7493 | 0.7510 | 0.7590 |
| Hit@1 | 0.5810 | 0.6075 | 0.6140 | 0.6220 |
- Prompting is the star: Adding the dual‑channel prompts alone recovers ~3 % absolute MRR gain.
- Augmentation is a modest booster: Test‑time transforms add ~0.5 % more MRR, confirming they help but are not the primary driver.
- Noisy signals hurt: Experiments with full WordNet definitions or multilingual synonym sets degrade performance, suggesting that CLIP prefers concise, visually grounded phrasing.
Practical Implications
- Search & recommendation: E‑commerce platforms can disambiguate user queries like “apple” (fruit vs. device) by matching to product images without training a custom vision model.
- Content moderation: Automated systems can better flag ambiguous text that references illicit imagery by grounding the meaning in visual candidates.
- Multimodal assistants: Voice assistants that need to fetch the “right picture” for a spoken word can plug this lightweight pipeline into existing CLIP‑based back‑ends.
- Low‑resource deployment: Because the approach only requires inference‑time operations on a frozen CLIP model, it runs on commodity GPUs or even on‑device accelerators with sub‑second latency.
Limitations & Future Work
- Dependence on CLIP’s pre‑training domain: Rare or highly specialized senses that CLIP never saw may still be mis‑ranked.
- Prompt engineering still manual: The dual‑channel prompts are hand‑crafted; an automated prompt‑generation or learned weighting could further improve robustness.
- Scalability to large candidate pools: The current setup evaluates a modest set of images per word; scaling to thousands of candidates would need efficient indexing (e.g., FAISS).
- Multilingual extension: Preliminary tests show noisy multilingual synonyms hurt performance; future work could explore language‑specific CLIP variants or cross‑lingual alignment techniques.
Bottom line: By marrying a simple prompt‑engineering trick with test‑time image augmentations, the authors demonstrate that you can get a noticeable boost in visual word sense disambiguation without any heavy model retraining. For developers building multimodal products, this recipe offers an immediate, low‑cost way to make ambiguous language more concrete and actionable.
Authors
- Shamik Bhattacharya
- Daniel Perkins
- Yaren Dogan
- Vineeth Konjeti
- Sudarshan Srinivasan
- Edmon Begoli
Paper Information
- arXiv ID: 2602.06799v1
- Categories: cs.CL
- Published: February 6, 2026
- PDF: Download PDF