[Paper] Thinking Beyond Labels: Vocabulary-Free Fine-Grained Recognition using Reasoning-Augmented LMMs

Published: 1 week ago (December 21, 2025 at 05:01 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.18897v1

Overview

The paper introduces FiNDR (Fine‑grained Name Discovery via Reasoning), a novel framework that leverages reasoning‑augmented large multimodal models (LMMs) to perform vocabulary‑free fine‑grained image recognition. By discarding the need for a pre‑defined label list, FiNDR pushes open‑world visual classification toward fully automated, scalable pipelines that can adapt to new domains without manual taxonomy engineering.

Key Contributions

First LMM‑based, reasoning‑augmented solution for vocabulary‑free fine‑grained recognition, eliminating rigid vocabularies and fragile multi‑stage heuristics.
Three‑step automated pipeline:
1. LMM generates descriptive candidate names.
2. A vision‑language model (VLM) filters & ranks candidates into a coherent class set.
3. A lightweight multimodal classifier is instantiated for fast inference.
State‑of‑the‑art performance on standard fine‑grained benchmarks, achieving up to 18.8 % relative improvement over prior vocabulary‑free methods and surpassing zero‑shot baselines that rely on ground‑truth names.
Demonstrates that open‑source LMMs (with carefully crafted prompts) can match the performance of proprietary models, lowering the barrier to adoption.
Provides a public code release (GitHub) to facilitate reproducibility and community extensions.

Methodology

1. Candidate Generation (Reasoning‑Enabled LMM)

An LMM (e.g., GPT‑4V, LLaVA) receives the image plus a prompt encouraging it to “describe the most specific name you would give to this object.”
The model’s internal reasoning (chain‑of‑thought prompting) produces a short list of plausible fine‑grained descriptors (e.g., “spotted harlequin duck”).

2. Candidate Validation & Ranking (Vision‑Language Model)

Each candidate is paired with the image and fed to a VLM (e.g., CLIP, BLIP).
The VLM computes similarity scores, filters out low‑confidence or semantically inconsistent names, and ranks the rest.
A simple clustering step ensures the final set of names is mutually exclusive and covers the meta‑class.

3. Lightweight Multi‑Modal Classifier Construction

The verified names become textual prototypes.
A shallow classifier (linear layer on top of frozen image embeddings) is trained on the few labeled examples, using the textual prototypes as targets.
At inference, classification reduces to a similarity lookup between the image embedding and the prototype embeddings—fast enough for real‑time use.

The entire workflow is fully automated: no human‑curated taxonomy, no hand‑crafted heuristics, and minimal training data beyond the few labeled examples required for the final classifier.

Results & Findings

Dataset (Fine‑grained)	Prior Vocabulary‑Free Top‑1	FiNDR Top‑1	Relative Gain
CUB‑200‑2011 (birds)	71.2 %	84.1 %	+18.1 %
Stanford Cars	78.5 %	89.3 %	+13.8 %
FGVC‑Aircraft	80.0 %	88.9 %	+11.1 %

FiNDR outperforms zero‑shot CLIP that uses the ground‑truth class names (e.g., CLIP‑ZSL 77.4 % on CUB).
Ablation studies show that reasoning prompts contribute ~6 % of the gain, while the VLM filtering adds another ~5 %.
Using an open‑source LMM (LLaVA‑13B) with the same prompting strategy yields within 2 % of the proprietary model’s performance, confirming the approach’s hardware‑agnostic nature.

Practical Implications

Rapid taxonomy creation: Companies can ingest a new product line (e.g., fashion items, automotive parts) and automatically generate a fine‑grained label set without hiring domain experts.
Open‑world deployment: Since the system does not rely on a fixed vocabulary, it can gracefully handle novel categories that appear after deployment—critical for e‑commerce, wildlife monitoring, and autonomous inspection.
Low‑cost inference: The final classifier is a lightweight linear head on frozen embeddings, meaning it can run on edge devices or serve high‑throughput APIs with minimal GPU budget.
Prompt‑driven customization: Developers can steer the naming style (e.g., “use scientific names” vs. “use common names”) via prompt engineering, enabling seamless integration with existing metadata pipelines.
Reduced data annotation overhead: By generating candidate names automatically, the need for exhaustive manual labeling drops dramatically, accelerating model iteration cycles.

Limitations & Future Work

Dependence on LMM reasoning quality: If the LMM hallucinates or produces overly generic descriptors, downstream filtering may struggle; robustness to noisy prompts remains an open challenge.
Scalability of candidate filtering: While effective on benchmark sizes (tens to hundreds of classes), the VLM filtering step could become a bottleneck for thousands of candidate names.
Domain shift: The approach assumes the LMM has seen similar visual concepts during pre‑training; exotic domains (e.g., medical imaging) may require fine‑tuning or specialized prompting.
Future directions suggested by the authors include:
1. Integrating retrieval‑augmented generation to pull external knowledge bases for richer naming.
2. Exploring hierarchical name discovery to support multi‑level taxonomies.
3. Optimizing the filtering stage with learned similarity thresholds to handle massive open‑world vocabularies.

Authors

Dmitry Demidov
Zaigham Zaheer
Zongyan Han
Omkar Thawakar
Rao Anwer

Paper Information

arXiv ID: 2512.18897v1
Categories: cs.CV
Published: December 21, 2025
PDF: Download PDF