[Paper] Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?

Published: 3 days ago (February 26, 2026 at 01:45 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.23339v1

Overview

Open‑vocabulary segmentation (OVS) lets you ask a vision‑language model to segment any object you can describe in text, but it still falls short of fully supervised models that are trained on pixel‑level labels. This paper shows that adding just a handful of annotated examples—a few‑shot support set—a can dramatically close that performance gap while keeping the flexibility of open‑vocabulary queries.

Key Contributions

Few‑shot OVS formulation: Introduces a test‑time setting where a small, user‑provided support set of pixel‑annotated images augments the textual prompt.
Retrieval‑augmented adapter: Proposes a lightweight per‑image classifier that fuses visual features from the support set with the text embedding of the query, learning the fusion per query rather than using fixed hand‑crafted rules.
Continual support expansion: The adapter can incorporate new support examples on the fly, enabling personalized or fine‑grained segmentation without retraining the whole model.
Strong empirical gains: Demonstrates that with as few as 1–5 support images, the method narrows the performance gap between zero‑shot OVS and fully supervised segmentation by up to 30 % on standard benchmarks.
Open‑vocabulary preservation: Even with the few‑shot boost, the system still accepts arbitrary text prompts, keeping the original flexibility of VLMs.

Methodology

Base model: Starts from a pre‑trained vision‑language model (e.g., CLIP) that provides a text embedding for the target class and a dense visual feature map for the input image.
Support set retrieval: For a given query, the system retrieves a small set of images that have pixel‑level masks for the same class (or a related class). These images are assumed to be available at test time (e.g., a user uploads a few annotated examples).
Feature extraction: Visual features are pooled from the support images using the provided masks, producing a support visual prototype for the class.
Learned fusion adapter: A tiny neural module (a few linear layers with a softmax) takes three inputs: the query’s visual features, the text embedding, and the support visual prototype. It learns a per‑query weighting that blends text‑only and vision‑only cues into a per‑image classifier.
Segmentation head: The fused classifier is applied to the dense query feature map, yielding a pixel‑wise probability map for the target class.
Continual update: Adding more support images simply updates the prototype (e.g., via averaging) and fine‑tunes the adapter with a few gradient steps—no full model retraining needed.

Results & Findings

Setting	mIoU (mean Intersection‑over‑Union)	Gap to Fully Supervised
Zero‑shot OVS (baseline)	38.2 %	30 %
Few‑shot (1 support)	44.9 %	23 %
Few‑shot (5 supports)	51.3 %	16 %
Fully supervised (same backbone)	68.2 %	—

Rapid improvement: Even a single annotated example yields a ~7 % absolute mIoU boost.
Diminishing returns: Gains plateau after ~5–10 examples, indicating the adapter efficiently extracts the most useful signal early on.
Fine‑grained tasks: On personalized segmentation (e.g., “my dog’s red collar”), the method outperforms prior zero‑shot OVS baselines by >15 % mIoU, showing it can capture subtle visual nuances.
Speed: The adapter adds < 5 ms inference overhead on a modern GPU, making it suitable for real‑time applications.

Practical Implications

Rapid prototyping: Developers can build custom segmentation tools by simply uploading a few labeled images instead of curating massive datasets.
Personalized AI services: SaaS platforms (e.g., photo editors, AR filters) can let users define their own segmentation classes on the fly—think “segment my favorite coffee mug” with only a couple of user‑provided masks.
Edge deployment: Because the adapter is tiny and operates at test time, it can run on‑device (mobile, embedded) alongside a frozen CLIP backbone, preserving privacy and reducing server load.
Continuous learning pipelines: Companies can continuously enrich their support pool with new examples collected from users, improving segmentation quality without costly retraining cycles.
Cross‑modal research: The learned fusion strategy can inspire similar few‑shot adapters for other tasks like open‑vocabulary detection, depth estimation, or video segmentation.

Limitations & Future Work

Support set quality: The approach assumes the few annotated masks are reasonably clean; noisy or highly inconsistent annotations can degrade performance.
Scalability of retrieval: While the paper uses a simple nearest‑neighbor lookup, scaling to millions of potential support images may require more sophisticated indexing.
Domain shift: The method is evaluated on standard benchmarks; performance on wildly different domains (e.g., medical imaging) remains an open question.
Extension to multi‑class queries: Current experiments focus on a single target class per inference; handling multiple simultaneous classes efficiently is left for future research.

Bottom line: By marrying a tiny, learned fusion module with a few user‑provided masks, this work shows that open‑vocabulary segmentation can get dramatically closer to fully supervised performance—without sacrificing the flexibility that makes VLMs so powerful. For developers, it opens the door to on‑demand, personalized segmentation services that can be built and iterated quickly.

Authors

Tilemachos Aravanis
Vladan Stojnić
Bill Psomas
Nikos Komodakis
Giorgos Tolias

Paper Information

arXiv ID: 2602.23339v1
Categories: cs.CV
Published: February 26, 2026
PDF: Download PDF

[Paper] Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] MediX-R1: Open Ended Medical Reinforcement Learning

[Paper] VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale

[Paper] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

[Paper] A Dataset is Worth 1 MB