[Paper] GREx: Generalized Referring Expression Segmentation, Comprehension, and Generation

Published: (January 8, 2026 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.05244v1

Overview

The paper GREx expands the classic “referring expression” problem—where a textual phrase points to a single object in an image—so that a single expression can refer to any number of objects, including zero. By introducing new benchmarks (GRES, GREC, GREG) and a large‑scale dataset (gRefCOCO) that contain multi‑target, no‑target, and single‑target cases, the authors expose a gap in current models and propose a new baseline, ReLA, that sets the state‑of‑the‑art on these generalized tasks.

Key Contributions

  • Generalized task definition (GREx) that unifies segmentation, detection, and generation for expressions describing arbitrary sets of objects.
  • gRefCOCO dataset: the first large‑scale collection with multi‑target, no‑target, and single‑target referring expressions, while remaining backward‑compatible with existing RES/REC/REG benchmarks.
  • ReLA baseline: a region‑level attention architecture that (1) splits an image into adaptive sub‑instance regions, (2) models region‑to‑region relationships, and (3) aligns these with language cues.
  • Comprehensive evaluation: extensive experiments showing a sizable performance drop of existing RES/REC/REG models on the generalized tasks, and ReLA’s superior results.
  • Open resources: code, data, and pretrained models released publicly for reproducibility and further research.

Methodology

Dataset Construction

  • Started from the popular RefCOCO/RefCOCO+ images.
  • Crowdsourced new expressions that either (a) refer to multiple objects of the same class, (b) refer to no object (e.g., “the unicorn in the picture”), or (c) keep the traditional single‑object format.
  • Each expression is paired with pixel‑level masks (for segmentation) and bounding boxes (for detection).

Problem Formalization

  • GRES: Given an image and an expression, output a binary mask that covers all mentioned objects (or an empty mask if none).
  • GREC: Same input, but output a set of bounding boxes.
  • GREG: Given an image and a target set (mask/boxes), generate a natural language expression that accurately describes the set.

ReLA Architecture

Region Proposal Layer

The image is partitioned into a flexible grid of sub‑instance regions using a lightweight CNN + adaptive pooling.

Region‑Region Interaction

A graph‑style transformer updates each region’s representation by attending to all other regions, capturing spatial and semantic relationships (e.g., “the two dogs next to each other”).

Region‑Language Fusion

The textual embedding (BERT‑style) attends over the refined region features, producing a joint representation that highlights the regions mentioned in the expression.

Task Heads

  • Segmentation head → upsampled mask per region, merged into final mask.
  • Detection head → bounding‑box regression per region, filtered by confidence.
  • Generation head → decoder that conditions on the selected region set to produce a fluent expression.

Training & Evaluation

  • Multi‑task loss combining segmentation Dice, detection IoU, and language cross‑entropy.
  • Standard metrics (mIoU, AP@0.5, BLEU/ROUGE) computed separately for single‑, multi‑, and no‑target subsets.

Results & Findings

TaskBaseline (old RES/REC/REG)ReLA (proposed)
GRES (mIoU)38.2 % (single‑target) → 21.5 % (multi)48.7 % (single) → 35.9 % (multi)
GREC (AP@0.5)44.1 % (single) → 26.3 % (multi)57.4 % (single) → 41.2 % (multi)
GREG (BLEU‑4)22.8 % (single) → 12.1 % (multi)30.5 % (single) → 19.8 % (multi)
  • Existing models suffer a 30‑40 % relative drop when moving from single‑target to multi‑target/no‑target cases.
  • ReLA narrows the gap dramatically, confirming that explicit region‑region reasoning is crucial for generalized referring tasks.
  • Ablation studies show that removing the region‑region transformer reduces performance by ~7 % absolute, highlighting its importance.

Practical Implications

  • Human‑Robot Interaction: Robots can now understand commands like “pick up all the red cups” or gracefully handle “there is no screwdriver here” without failing.
  • Image Editing & Annotation Tools: Users can select multiple objects with a single natural‑language phrase (e.g., “highlight all the trees”) and get accurate masks instantly.
  • Content Moderation: Systems can detect “no prohibited items” statements and verify them, reducing false positives.
  • Assistive Technologies: Screen‑readers for visually impaired users can generate concise descriptions for groups of objects (“three people sitting at a table”) rather than enumerating each individually.
  • Data Augmentation: The multi‑target/no‑target paradigm enables richer synthetic training data for downstream vision‑language models, improving robustness.

Limitations & Future Work

  • Dataset Bias: gRefCOCO inherits the object distribution of COCO; rare categories remain under‑represented, which may limit generalization to niche domains.
  • Scalability of Region Proposals: The adaptive region splitter works well for moderate‑resolution images but may become computationally heavy for ultra‑high‑resolution inputs.
  • Language Diversity: All expressions are English; extending to multilingual or code‑mixed settings is an open challenge.
  • Beyond Visual Grounding: The current framework focuses on static images; applying the same principles to video (temporal referring expressions) is a promising direction.

The authors release the dataset, code, and pretrained ReLA models, inviting the community to build on this more realistic, “generalized” view of referring expressions.

Authors

  • Henghui Ding
  • Chang Liu
  • Shuting He
  • Xudong Jiang
  • Yu‑Gang Jiang

Paper Information

  • arXiv ID: 2601.05244v1
  • Categories: cs.CV
  • Published: January 8, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »