[Paper] Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?

Published: (February 26, 2026 at 01:45 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.23339v1

Overview

Open‑vocabulary segmentation (OVS) lets you ask a vision‑language model to segment any object you can describe in text, but it still falls short of fully supervised models that are trained on pixel‑level labels. This paper shows that adding just a handful of annotated examples—a few‑shot support set—a can dramatically close that performance gap while keeping the flexibility of open‑vocabulary queries.

Key Contributions

  • Few‑shot OVS formulation: Introduces a test‑time setting where a small, user‑provided support set of pixel‑annotated images augments the textual prompt.
  • Retrieval‑augmented adapter: Proposes a lightweight per‑image classifier that fuses visual features from the support set with the text embedding of the query, learning the fusion per query rather than using fixed hand‑crafted rules.
  • Continual support expansion: The adapter can incorporate new support examples on the fly, enabling personalized or fine‑grained segmentation without retraining the whole model.
  • Strong empirical gains: Demonstrates that with as few as 1–5 support images, the method narrows the performance gap between zero‑shot OVS and fully supervised segmentation by up to 30 % on standard benchmarks.
  • Open‑vocabulary preservation: Even with the few‑shot boost, the system still accepts arbitrary text prompts, keeping the original flexibility of VLMs.

Methodology

  1. Base model: Starts from a pre‑trained vision‑language model (e.g., CLIP) that provides a text embedding for the target class and a dense visual feature map for the input image.
  2. Support set retrieval: For a given query, the system retrieves a small set of images that have pixel‑level masks for the same class (or a related class). These images are assumed to be available at test time (e.g., a user uploads a few annotated examples).
  3. Feature extraction: Visual features are pooled from the support images using the provided masks, producing a support visual prototype for the class.
  4. Learned fusion adapter: A tiny neural module (a few linear layers with a softmax) takes three inputs: the query’s visual features, the text embedding, and the support visual prototype. It learns a per‑query weighting that blends text‑only and vision‑only cues into a per‑image classifier.
  5. Segmentation head: The fused classifier is applied to the dense query feature map, yielding a pixel‑wise probability map for the target class.
  6. Continual update: Adding more support images simply updates the prototype (e.g., via averaging) and fine‑tunes the adapter with a few gradient steps—no full model retraining needed.

Results & Findings

SettingmIoU (mean Intersection‑over‑Union)Gap to Fully Supervised
Zero‑shot OVS (baseline)38.2 %30 %
Few‑shot (1 support)44.9 %23 %
Few‑shot (5 supports)51.3 %16 %
Fully supervised (same backbone)68.2 %
  • Rapid improvement: Even a single annotated example yields a ~7 % absolute mIoU boost.
  • Diminishing returns: Gains plateau after ~5–10 examples, indicating the adapter efficiently extracts the most useful signal early on.
  • Fine‑grained tasks: On personalized segmentation (e.g., “my dog’s red collar”), the method outperforms prior zero‑shot OVS baselines by >15 % mIoU, showing it can capture subtle visual nuances.
  • Speed: The adapter adds < 5 ms inference overhead on a modern GPU, making it suitable for real‑time applications.

Practical Implications

  • Rapid prototyping: Developers can build custom segmentation tools by simply uploading a few labeled images instead of curating massive datasets.
  • Personalized AI services: SaaS platforms (e.g., photo editors, AR filters) can let users define their own segmentation classes on the fly—think “segment my favorite coffee mug” with only a couple of user‑provided masks.
  • Edge deployment: Because the adapter is tiny and operates at test time, it can run on‑device (mobile, embedded) alongside a frozen CLIP backbone, preserving privacy and reducing server load.
  • Continuous learning pipelines: Companies can continuously enrich their support pool with new examples collected from users, improving segmentation quality without costly retraining cycles.
  • Cross‑modal research: The learned fusion strategy can inspire similar few‑shot adapters for other tasks like open‑vocabulary detection, depth estimation, or video segmentation.

Limitations & Future Work

  • Support set quality: The approach assumes the few annotated masks are reasonably clean; noisy or highly inconsistent annotations can degrade performance.
  • Scalability of retrieval: While the paper uses a simple nearest‑neighbor lookup, scaling to millions of potential support images may require more sophisticated indexing.
  • Domain shift: The method is evaluated on standard benchmarks; performance on wildly different domains (e.g., medical imaging) remains an open question.
  • Extension to multi‑class queries: Current experiments focus on a single target class per inference; handling multiple simultaneous classes efficiently is left for future research.

Bottom line: By marrying a tiny, learned fusion module with a few user‑provided masks, this work shows that open‑vocabulary segmentation can get dramatically closer to fully supervised performance—without sacrificing the flexibility that makes VLMs so powerful. For developers, it opens the door to on‑demand, personalized segmentation services that can be built and iterated quickly.

Authors

  • Tilemachos Aravanis
  • Vladan Stojnić
  • Bill Psomas
  • Nikos Komodakis
  • Giorgos Tolias

Paper Information

  • arXiv ID: 2602.23339v1
  • Categories: cs.CV
  • Published: February 26, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] A Dataset is Worth 1 MB

A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on divers...