[Paper] Thinking Beyond Labels: Vocabulary-Free Fine-Grained Recognition using Reasoning-Augmented LMMs
Source: arXiv - 2512.18897v1
Overview
The paper introduces FiNDR (Fine‑grained Name Discovery via Reasoning), a novel framework that leverages reasoning‑augmented large multimodal models (LMMs) to perform vocabulary‑free fine‑grained image recognition. By discarding the need for a pre‑defined label list, FiNDR pushes open‑world visual classification toward fully automated, scalable pipelines that can adapt to new domains without manual taxonomy engineering.
Key Contributions
- First LMM‑based, reasoning‑augmented solution for vocabulary‑free fine‑grained recognition, eliminating rigid vocabularies and fragile multi‑stage heuristics.
- Three‑step automated pipeline:
- LMM generates descriptive candidate names.
- A vision‑language model (VLM) filters & ranks candidates into a coherent class set.
- A lightweight multimodal classifier is instantiated for fast inference.
- State‑of‑the‑art performance on standard fine‑grained benchmarks, achieving up to 18.8 % relative improvement over prior vocabulary‑free methods and surpassing zero‑shot baselines that rely on ground‑truth names.
- Demonstrates that open‑source LMMs (with carefully crafted prompts) can match the performance of proprietary models, lowering the barrier to adoption.
- Provides a public code release (GitHub) to facilitate reproducibility and community extensions.
Methodology
1. Candidate Generation (Reasoning‑Enabled LMM)
- An LMM (e.g., GPT‑4V, LLaVA) receives the image plus a prompt encouraging it to “describe the most specific name you would give to this object.”
- The model’s internal reasoning (chain‑of‑thought prompting) produces a short list of plausible fine‑grained descriptors (e.g., “spotted harlequin duck”).
2. Candidate Validation & Ranking (Vision‑Language Model)
- Each candidate is paired with the image and fed to a VLM (e.g., CLIP, BLIP).
- The VLM computes similarity scores, filters out low‑confidence or semantically inconsistent names, and ranks the rest.
- A simple clustering step ensures the final set of names is mutually exclusive and covers the meta‑class.
3. Lightweight Multi‑Modal Classifier Construction
- The verified names become textual prototypes.
- A shallow classifier (linear layer on top of frozen image embeddings) is trained on the few labeled examples, using the textual prototypes as targets.
- At inference, classification reduces to a similarity lookup between the image embedding and the prototype embeddings—fast enough for real‑time use.
The entire workflow is fully automated: no human‑curated taxonomy, no hand‑crafted heuristics, and minimal training data beyond the few labeled examples required for the final classifier.
Results & Findings
| Dataset (Fine‑grained) | Prior Vocabulary‑Free Top‑1 | FiNDR Top‑1 | Relative Gain |
|---|---|---|---|
| CUB‑200‑2011 (birds) | 71.2 % | 84.1 % | +18.1 % |
| Stanford Cars | 78.5 % | 89.3 % | +13.8 % |
| FGVC‑Aircraft | 80.0 % | 88.9 % | +11.1 % |
- FiNDR outperforms zero‑shot CLIP that uses the ground‑truth class names (e.g., CLIP‑ZSL 77.4 % on CUB).
- Ablation studies show that reasoning prompts contribute ~6 % of the gain, while the VLM filtering adds another ~5 %.
- Using an open‑source LMM (LLaVA‑13B) with the same prompting strategy yields within 2 % of the proprietary model’s performance, confirming the approach’s hardware‑agnostic nature.
Practical Implications
- Rapid taxonomy creation: Companies can ingest a new product line (e.g., fashion items, automotive parts) and automatically generate a fine‑grained label set without hiring domain experts.
- Open‑world deployment: Since the system does not rely on a fixed vocabulary, it can gracefully handle novel categories that appear after deployment—critical for e‑commerce, wildlife monitoring, and autonomous inspection.
- Low‑cost inference: The final classifier is a lightweight linear head on frozen embeddings, meaning it can run on edge devices or serve high‑throughput APIs with minimal GPU budget.
- Prompt‑driven customization: Developers can steer the naming style (e.g., “use scientific names” vs. “use common names”) via prompt engineering, enabling seamless integration with existing metadata pipelines.
- Reduced data annotation overhead: By generating candidate names automatically, the need for exhaustive manual labeling drops dramatically, accelerating model iteration cycles.
Limitations & Future Work
- Dependence on LMM reasoning quality: If the LMM hallucinates or produces overly generic descriptors, downstream filtering may struggle; robustness to noisy prompts remains an open challenge.
- Scalability of candidate filtering: While effective on benchmark sizes (tens to hundreds of classes), the VLM filtering step could become a bottleneck for thousands of candidate names.
- Domain shift: The approach assumes the LMM has seen similar visual concepts during pre‑training; exotic domains (e.g., medical imaging) may require fine‑tuning or specialized prompting.
- Future directions suggested by the authors include:
- Integrating retrieval‑augmented generation to pull external knowledge bases for richer naming.
- Exploring hierarchical name discovery to support multi‑level taxonomies.
- Optimizing the filtering stage with learned similarity thresholds to handle massive open‑world vocabularies.
Authors
- Dmitry Demidov
- Zaigham Zaheer
- Zongyan Han
- Omkar Thawakar
- Rao Anwer
Paper Information
- arXiv ID: 2512.18897v1
- Categories: cs.CV
- Published: December 21, 2025
- PDF: Download PDF