Detecting Objects in Images from Any Text Prompt (Not Fixed Classes)
Source: Dev.to
Background
Most object detection systems assume a fixed label set: you train a model on COCO, Open Images, or a custom dataset, and you’re limited to the classes you trained for.
Prompt‑Based Object Detection
I’ve been exploring a different approach: prompt‑based object detection, where the inputs are
- an image
- a free‑form natural language prompt
and the output is a set of localized detections that match the prompt, even when the concept isn’t a single predefined object class.
The tool I built supports complex, compositional prompts, not just simple object names. These prompts can combine attributes, relations, text, and world knowledge—things that don’t map cleanly to standard detector classes.
What It’s Not Designed For
- Very small objects
- Obscure, barely visible objects
- Dense real‑time detection out of the box
It performs better on concepts that require reasoning and world knowledge rather than pixel‑level precision on tiny targets.
Motivation
The main motivation so far has been creating training data for highly specific detectors. Instead of manually labeling or training a new detector for every niche concept, this approach can be used to:
- Bootstrap datasets
- Explore whether a concept is learnable
- Validate prompts before committing to full training pipelines
Demo
I’ve made the tool publicly available as a demo:
Detect Anything – Free AI Object Detection Online
- No login required.
- Images are processed transiently and not stored.
- (Please don’t abuse it; inference is relatively expensive.)
Open Questions
I’m especially interested in:
- Good real‑world use cases people see for this
- Stress‑testing and failure modes
- Situations where this approach breaks down compared to task‑specific detectors
If you’ve worked with grounding, referring‑expression comprehension, or prompt‑based vision models, I’d love to hear your thoughts.