[Paper] GREx: Generalized Referring Expression Segmentation, Comprehension, and Generation

Published: 1 month ago (January 8, 2026 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.05244v1

Overview

The paper GREx expands the classic “referring expression” problem—where a textual phrase points to a single object in an image—so that a single expression can refer to any number of objects, including zero. By introducing new benchmarks (GRES, GREC, GREG) and a large‑scale dataset (gRefCOCO) that contain multi‑target, no‑target, and single‑target cases, the authors expose a gap in current models and propose a new baseline, ReLA, that sets the state‑of‑the‑art on these generalized tasks.

Key Contributions

Generalized task definition (GREx) that unifies segmentation, detection, and generation for expressions describing arbitrary sets of objects.
gRefCOCO dataset: the first large‑scale collection with multi‑target, no‑target, and single‑target referring expressions, while remaining backward‑compatible with existing RES/REC/REG benchmarks.
ReLA baseline: a region‑level attention architecture that (1) splits an image into adaptive sub‑instance regions, (2) models region‑to‑region relationships, and (3) aligns these with language cues.
Comprehensive evaluation: extensive experiments showing a sizable performance drop of existing RES/REC/REG models on the generalized tasks, and ReLA’s superior results.
Open resources: code, data, and pretrained models released publicly for reproducibility and further research.

Methodology

Dataset Construction

Started from the popular RefCOCO/RefCOCO+ images.
Crowdsourced new expressions that either (a) refer to multiple objects of the same class, (b) refer to no object (e.g., “the unicorn in the picture”), or (c) keep the traditional single‑object format.
Each expression is paired with pixel‑level masks (for segmentation) and bounding boxes (for detection).

Problem Formalization

GRES: Given an image and an expression, output a binary mask that covers all mentioned objects (or an empty mask if none).
GREC: Same input, but output a set of bounding boxes.
GREG: Given an image and a target set (mask/boxes), generate a natural language expression that accurately describes the set.

ReLA Architecture

Region Proposal Layer

The image is partitioned into a flexible grid of sub‑instance regions using a lightweight CNN + adaptive pooling.

Region‑Region Interaction

A graph‑style transformer updates each region’s representation by attending to all other regions, capturing spatial and semantic relationships (e.g., “the two dogs next to each other”).

Region‑Language Fusion

The textual embedding (BERT‑style) attends over the refined region features, producing a joint representation that highlights the regions mentioned in the expression.

Task Heads

Segmentation head → upsampled mask per region, merged into final mask.
Detection head → bounding‑box regression per region, filtered by confidence.
Generation head → decoder that conditions on the selected region set to produce a fluent expression.

Training & Evaluation

Multi‑task loss combining segmentation Dice, detection IoU, and language cross‑entropy.
Standard metrics (mIoU, AP@0.5, BLEU/ROUGE) computed separately for single‑, multi‑, and no‑target subsets.

Results & Findings

Task	Baseline (old RES/REC/REG)	ReLA (proposed)
GRES (mIoU)	38.2 % (single‑target) → 21.5 % (multi)	48.7 % (single) → 35.9 % (multi)
GREC (AP@0.5)	44.1 % (single) → 26.3 % (multi)	57.4 % (single) → 41.2 % (multi)
GREG (BLEU‑4)	22.8 % (single) → 12.1 % (multi)	30.5 % (single) → 19.8 % (multi)

Existing models suffer a 30‑40 % relative drop when moving from single‑target to multi‑target/no‑target cases.
ReLA narrows the gap dramatically, confirming that explicit region‑region reasoning is crucial for generalized referring tasks.
Ablation studies show that removing the region‑region transformer reduces performance by ~7 % absolute, highlighting its importance.

Practical Implications

Human‑Robot Interaction: Robots can now understand commands like “pick up all the red cups” or gracefully handle “there is no screwdriver here” without failing.
Image Editing & Annotation Tools: Users can select multiple objects with a single natural‑language phrase (e.g., “highlight all the trees”) and get accurate masks instantly.
Content Moderation: Systems can detect “no prohibited items” statements and verify them, reducing false positives.
Assistive Technologies: Screen‑readers for visually impaired users can generate concise descriptions for groups of objects (“three people sitting at a table”) rather than enumerating each individually.
Data Augmentation: The multi‑target/no‑target paradigm enables richer synthetic training data for downstream vision‑language models, improving robustness.

Limitations & Future Work

Dataset Bias: gRefCOCO inherits the object distribution of COCO; rare categories remain under‑represented, which may limit generalization to niche domains.
Scalability of Region Proposals: The adaptive region splitter works well for moderate‑resolution images but may become computationally heavy for ultra‑high‑resolution inputs.
Language Diversity: All expressions are English; extending to multilingual or code‑mixed settings is an open challenge.
Beyond Visual Grounding: The current framework focuses on static images; applying the same principles to video (temporal referring expressions) is a promising direction.

The authors release the dataset, code, and pretrained ReLA models, inviting the community to build on this more realistic, “generalized” view of referring expressions.

Authors

Henghui Ding
Chang Liu
Shuting He
Xudong Jiang
Yu‑Gang Jiang

Paper Information

arXiv ID: 2601.05244v1
Categories: cs.CV
Published: January 8, 2026
PDF: Download PDF

[Paper] GREx: Generalized Referring Expression Segmentation, Comprehension, and Generation

Overview

Key Contributions

Methodology

Dataset Construction

Problem Formalization

ReLA Architecture

Region Proposal Layer

Region‑Region Interaction

Region‑Language Fusion

Task Heads

Training & Evaluation

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Deepfake detectors are DUMB: A benchmark to assess adversarial training robustness under transferability constraints

[Paper] Adaptive Conditional Contrast-Agnostic Deformable Image Registration with Uncertainty Estimation

[Paper] VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction

[Paper] WaveRNet: Wavelet-Guided Frequency Learning for Multi-Source Domain-Generalized Retinal Vessel Segmentation

Overview

Key Contributions

Methodology

Dataset Construction

Problem Formalization

ReLA Architecture

Region Proposal Layer

Region‑Region Interaction

Region‑Language Fusion

Task Heads

Training & Evaluation

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Deepfake detectors are DUMB: A benchmark to assess adversarial training robustness under transferability constraints

[Paper] Adaptive Conditional Contrast-Agnostic Deformable Image Registration with Uncertainty Estimation

[Paper] VideoAR: Autoregressive Video Generation via Next-Frame &amp; Scale Prediction

[Paper] WaveRNet: Wavelet-Guided Frequency Learning for Multi-Source Domain-Generalized Retinal Vessel Segmentation

[Paper] VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction