[Paper] Seeing Through Words: Controlling Visual Retrieval Quality with Language Models
Source: arXiv - 2602.21175v1
Overview
Text‑to‑image retrieval systems have gotten dramatically better, but they still stumble when users type in ultra‑short, vague queries like “dog” or “sunset”. Those one‑ or two‑word prompts leave the model guessing which visual details matter and give users no way to ask for higher‑quality results. The paper Seeing Through Words: Controlling Visual Retrieval Quality with Language Models proposes a simple yet powerful fix: let a large language model (LLM) flesh out the short query into a richer description and let the user steer the description toward a desired quality level.
Key Contributions
- Quality‑controllable query expansion – a generic framework that augments terse queries with fine‑grained visual attributes (pose, lighting, composition, etc.) while respecting a user‑specified quality tier.
- LLM‑driven completion conditioned on discretized quality levels – the language model receives both the original query and a “quality token” (e.g., high‑quality, medium‑quality) and generates a detailed caption that reflects that level.
- Plug‑and‑play compatibility – the method works on top of any pretrained vision‑language model (CLIP, BLIP, etc.) without retraining or architectural changes.
- Transparent, interpretable output – the enriched query is human‑readable, so users can see exactly what the system is asking the image encoder to match.
- Empirical gains – across several benchmark datasets the approach improves recall@k by up to 12 % and enables reliable quality steering, as shown by user studies and automatic aesthetic metrics.
Methodology
-
Quality discretization – the authors first run two off‑the‑shelf scorers on the image corpus: a relevance model (how well an image matches the original query) and an aesthetic model (photo quality). Images are bucketed into a small set of quality levels (e.g., low, mid, high).
-
Prompt construction for the LLM – given a user’s short query q and a target quality level c, they build a prompt like:
Complete the following image description for a high‑quality photo of "sunset":The LLM (GPT‑2/3‑style) then generates a longer, attribute‑rich sentence (e.g., “a vibrant orange‑red sunset over a calm lake, with silhouetted mountains and a golden‑hour glow”).
-
Retrieval with a frozen VLM – the expanded description is encoded by the existing vision‑language model, and standard similarity search (e.g., dot‑product) retrieves the top‑k images. No fine‑tuning of the VLM is required.
-
Iterative control – users can switch the quality token and re‑run the same pipeline, instantly shifting the retrieved set toward higher or lower aesthetic standards.
Results & Findings
| Dataset | Baseline (CLIP) R@10 | QC‑QC (proposed) R@10 | Δ |
|---|---|---|---|
| MS‑COCO (short queries) | 38.2 % | 45.9 % | +7.7 % |
| Flickr30k (single‑word queries) | 31.5 % | 38.1 % | +6.6 % |
- Quality steering works: When the quality token is set to high, retrieved images score 0.42 higher on the aesthetic predictor (on a 0‑1 scale) compared to the baseline; low quality tokens produce the opposite trend.
- Human evaluation: In a 200‑image user study, participants preferred the QC‑QC results 68 % of the time, citing clearer composition and better relevance to the expanded description.
- Zero‑training advantage: Because the VLM stays frozen, the method adds < 0.5 GB of extra parameters and runs in real‑time (< 30 ms per query on a single GPU).
Practical Implications
- Search engines & e‑commerce – Shoppers typing “dress” can instantly ask for “high‑quality, front‑view, silk dress” without manually adding adjectives, leading to more satisfying product listings.
- Creative tools – Designers using text‑to‑image generators can pre‑filter results by quality, reducing the time spent sifting through low‑resolution or poorly composed outputs.
- Content moderation – Platforms can enforce a minimum aesthetic threshold for user‑generated images, helping maintain visual standards.
- Rapid prototyping – Because the approach is model‑agnostic, teams can plug it into existing CLIP‑based retrieval pipelines with a few lines of code, gaining immediate performance lifts.
Limitations & Future Work
- Reliance on LLM quality – The richness of the expanded query depends on the language model’s knowledge; rare or domain‑specific terms may be poorly elaborated.
- Discrete quality buckets – The current three‑level scheme may be too coarse for nuanced applications; learning a continuous quality embedding could improve granularity.
- Scalability of scoring models – The relevance and aesthetic scorers need to be run on the whole image corpus to assign quality levels, which can be costly for very large datasets.
- User study scope – The human evaluation covered a limited set of categories; broader user testing across languages and cultures is left for future work.
The authors have released their code (https://github.com/Jianglin954/QCQC), making it easy for developers to experiment with quality‑controllable retrieval in their own projects.
Authors
- Jianglin Lu
- Simon Jenni
- Kushal Kafle
- Jing Shi
- Handong Zhao
- Yun Fu
Paper Information
- arXiv ID: 2602.21175v1
- Categories: cs.CV
- Published: February 24, 2026
- PDF: Download PDF