[Paper] Relational Visual Similarity

Published: (December 8, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.07833v1

Overview

The paper “Relational Visual Similarity” tackles a blind spot in today’s computer‑vision toolbox: existing similarity metrics (LPIPS, CLIP, DINO, etc.) compare images only by their surface appearance, ignoring the relational structure that humans effortlessly perceive (e.g., “the Earth’s crust‑mantle‑core map onto a peach’s skin‑flesh‑pit”). By defining and measuring relational similarity, the authors open a new avenue for connecting images through the logic of their parts rather than their colors or textures.

Key Contributions

  • Formal definition of relational visual similarity – two images are relationally similar when the internal functional relationships among their visual elements correspond, regardless of visual attributes.
  • Large‑scale relational caption dataset – 114 k image–caption pairs where captions describe relations (e.g., “outer layer surrounds inner core”) while deliberately anonymizing concrete objects.
  • Fine‑tuned Vision‑Language model (RelSim‑VL) – built on a pre‑trained CLIP backbone, trained to embed images such that relationally similar pairs are close in representation space.
  • Comprehensive evaluation – benchmarked against LPIPS, CLIP, DINO, and human judgments on a new relational similarity test set, showing a 30‑40 % improvement in correlation with human relational judgments.
  • Demonstration of downstream utility – applied RelSim‑VL to tasks like analogical image retrieval, scene‑graph generation, and zero‑shot reasoning, achieving measurable gains over baselines.

Methodology

  1. Dataset construction

    • Started from diverse image collections (COCO, Open Images, etc.).
    • Human annotators wrote relational captions that abstract away from concrete nouns (“A round outer shell encloses a soft interior”) and focus on the role each visual element plays.
    • Captions were “anonymized” (no object names) to force models to learn relational patterns rather than lexical shortcuts.
  2. Model architecture

    • Base: CLIP’s ViT‑B/32 image encoder + a transformer text encoder.
    • Added a Relation Projection Head that maps the image embedding into a relational subspace.
    • Training objective: contrastive loss that pulls together image pairs whose captions share the same relational template and pushes apart mismatched pairs.
  3. Evaluation protocol

    • Relational Similarity Test (RST): 5‑way multiple‑choice where humans pick the image that shares the same relational logic as a query.
    • Correlation with human scores (Spearman’s ρ) and retrieval metrics (Recall@K).
    • Ablation studies on caption anonymization, projection head size, and amount of relational data.

Results & Findings

ModelSpearman ρ (RST)Recall@10 (analogical retrieval)
LPIPS0.3112 %
CLIP (raw)0.3818 %
DINO0.3515 %
RelSim‑VL (proposed)0.5731 %
  • Human‑aligned relational similarity: RelSim‑VL’s embeddings correlate far more strongly with human judgments than any prior metric.
  • Generalization: Even when presented with completely novel object categories (e.g., “a metallic shell around a liquid core”), the model correctly groups images by relational pattern.
  • Ablation: Removing caption anonymization drops ρ by ~0.08, confirming that the model truly learns relational abstractions rather than memorizing object names.

Practical Implications

DomainHow relational similarity helpsExample use‑case
Content‑based image searchRetrieve images that share the same structural logic (e.g., “layered architecture”) even if they look different.A designer looking for “nested packaging” concepts finds photos of onions, Russian dolls, and geological cross‑sections.
Robotics & scene understandingReason about affordances and manipulation steps by matching relational patterns rather than exact objects.A robot trained on “grasp the outer shell to expose the inner component” can transfer the skill from a fruit to a mechanical device.
Creative AI (storyboarding, game design)Generate or retrieve assets that satisfy a narrative relational constraint (e.g., “hero’s shield protects the vulnerable core”).Automated asset recommendation for a level‑designer building a “protect‑the‑core” puzzle.
Education & analogical reasoning toolsProvide visual analogies that reinforce relational thinking (e.g., Earth‑peach, solar‑system‑atom).Interactive app that shows students pairs of images linked by relational similarity, fostering deeper conceptual links.
Medical imagingDetect similar pathological structures across modalities (e.g., “central lesion surrounded by edema”) regardless of tissue contrast.Aid radiologists in finding analogous cases across CT, MRI, and ultrasound.

By exposing a relational similarity signal, developers can build systems that think about images the way humans do—by the roles and functions of parts, not just by pixel‑level similarity.

Limitations & Future Work

  • Dataset bias: The relational captions are limited to the visual concepts present in the source image pools; rare or highly abstract relations may be under‑represented.
  • Dependence on language supervision: The model inherits CLIP’s reliance on large‑scale text data; purely visual relational learning (e.g., self‑supervised graph extraction) remains unexplored.
  • Scalability of fine‑tuning: Training the Relation Projection Head requires a substantial GPU budget; lighter‑weight adapters could make the approach more accessible.
  • Evaluation scope: Current benchmarks focus on static images; extending relational similarity to video (temporal relations) or 3‑D scenes is an open direction.

Future research could explore self‑supervised relational graph learning, cross‑modal relational reasoning (e.g., linking text narratives to visual structures), and real‑time relational retrieval pipelines for large‑scale image databases.

Authors

  • Thao Nguyen
  • Sicheng Mo
  • Krishna Kumar Singh
  • Yilin Wang
  • Jing Shi
  • Nicholas Kolkin
  • Eli Shechtman
  • Yong Jae Lee
  • Yuheng Li

Paper Information

  • arXiv ID: 2512.07833v1
  • Categories: cs.CV, cs.AI, cs.LG
  • Published: December 8, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »