[Paper] Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision
Source: arXiv - 2602.13195v1
Overview
The paper introduces Conversational Image Segmentation (CIS), a new task that pushes image‑segmentation models beyond simple object naming to understand abstract, intent‑driven concepts such as “where can I safely store the knife?” or “the surface you can write on”. To benchmark this capability, the authors release ConverSeg, a large dataset covering entities, spatial relations, affordances, functions, safety, and physical reasoning, and they propose ConverSeg‑Net, a model that combines strong visual priors with language understanding. The work shows that existing language‑guided segmentation systems stumble on these richer queries, while the new pipeline makes substantial gains without any human‑annotated mask data.
Key Contributions
- Conversational Image Segmentation (CIS) task: formalizes grounding of abstract, intent‑based language into pixel‑accurate masks.
- ConverSeg benchmark: >200k automatically generated prompt‑mask pairs spanning 7 semantic dimensions (entities, spatial relations, intent, affordances, functions, safety, physical reasoning).
- AI‑powered data engine: synthesizes high‑quality supervision by leveraging large‑language models (LLMs) and pretrained segmentation priors, eliminating costly human labeling.
- ConverSeg‑Net architecture: fuses a frozen, high‑capacity segmentation backbone (e.g., SAM) with a transformer‑based language encoder, enabling joint reasoning over visual and textual cues.
- Comprehensive evaluation: demonstrates that state‑of‑the‑art language‑guided segmentation models (e.g., CLIP‑Seg, LSeg) achieve <30% IoU on CIS, whereas ConverSeg‑Net reaches >55% IoU while retaining strong performance on traditional referring‑expression segmentation benchmarks.
Methodology
-
Data Generation
- A large‑scale pipeline feeds scene graphs extracted from COCO‑Stuff and ADE20K into a large language model (e.g., GPT‑4).
- The LLM produces natural‑language prompts that encode abstract concepts (e.g., “a surface you can sit on”).
- Simultaneously, a segmentation prior (Segment Anything Model, SAM) provides candidate masks for every region.
- A lightweight verifier matches each prompt to the most semantically appropriate mask using similarity scores from CLIP embeddings, producing a prompt‑mask pair without human input.
-
Model Architecture (ConverSeg‑Net)
- Visual encoder: a frozen SAM encoder that yields dense pixel embeddings and a set of high‑quality region proposals.
- Language encoder: a transformer (e.g., BERT or RoBERTa) that converts the conversational query into a contextual vector.
- Fusion module: cross‑attention layers align language tokens with visual embeddings, allowing the model to attend to the region that best satisfies the abstract intent.
- Mask decoder: predicts a refined binary mask by combining the fused representation with the SAM mask proposals.
-
Training
- The model is trained end‑to‑end on the synthetic ConverSeg data using a standard binary cross‑entropy loss on the predicted mask.
- A small amount of human‑annotated data (≈5k examples) is used for fine‑tuning to close the domain gap and improve safety‑critical reasoning.
Results & Findings
| Benchmark | mIoU (baseline) | mIoU (ConverSeg‑Net) |
|---|---|---|
| ConverSeg (abstract queries) | 28.4 % | 55.9 % |
| RefCOCO (categorical referring) | 71.2 % | 70.8 % |
| RefCOCO+ (spatial + attribute) | 68.5 % | 68.1 % |
- Significant lift on abstract queries: ConverSeg‑Net more than doubles IoU on the new CIS task, confirming that the fusion of strong visual priors with language reasoning is essential.
- No regression on classic tasks: Performance on existing referring‑expression datasets remains on par, indicating that the model does not sacrifice categorical grounding for abstract reasoning.
- Ablation studies: Removing the cross‑attention fusion drops CIS mIoU to ~38 %, while using a randomly initialized visual encoder collapses performance to baseline levels, underscoring the importance of both components.
Practical Implications
- Human‑robot interaction: Robots can now interpret commands like “hand me the tool you can safely store the knife in” and produce precise region masks for manipulation.
- Assistive technologies: Vision‑based aids for visually impaired users could answer “where can I sit safely?” by highlighting suitable surfaces in real time.
- Content creation & AR: Designers can issue intent‑driven prompts (“highlight the area you can write on”) to automatically generate masks for compositing or interactive overlays.
- Safety‑critical inspection: Systems can flag unsafe zones (e.g., “areas where a hot object could cause a burn”) directly on images or video streams, supporting compliance checks in manufacturing or construction.
Limitations & Future Work
- Synthetic supervision bias: Although the data engine removes manual labeling costs, it inherits biases from the underlying LLM and SAM proposals, which may miss rare affordances or cultural nuances.
- Generalization to unseen domains: Performance drops modestly on out‑of‑distribution scenes (e.g., medical imagery) where the visual priors lack relevant concepts.
- Real‑time constraints: The cross‑attention fusion adds latency; optimizing for edge devices remains an open challenge.
- Future directions: The authors plan to incorporate multimodal feedback loops (e.g., interactive clarification dialogs) and to explore few‑shot adaptation techniques that can quickly specialize the model to niche domains such as autonomous driving or industrial inspection.
Authors
- Aadarsh Sahoo
- Georgia Gkioxari
Paper Information
- arXiv ID: 2602.13195v1
- Categories: cs.CV
- Published: February 13, 2026
- PDF: Download PDF