[Paper] Relational Visual Similarity

Published: 1 day ago (December 8, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.07833v1

Overview

The paper “Relational Visual Similarity” tackles a blind spot in today’s computer‑vision toolbox: existing similarity metrics (LPIPS, CLIP, DINO, etc.) compare images only by their surface appearance, ignoring the relational structure that humans effortlessly perceive (e.g., “the Earth’s crust‑mantle‑core map onto a peach’s skin‑flesh‑pit”). By defining and measuring relational similarity, the authors open a new avenue for connecting images through the logic of their parts rather than their colors or textures.

Key Contributions

Formal definition of relational visual similarity – two images are relationally similar when the internal functional relationships among their visual elements correspond, regardless of visual attributes.
Large‑scale relational caption dataset – 114 k image–caption pairs where captions describe relations (e.g., “outer layer surrounds inner core”) while deliberately anonymizing concrete objects.
Fine‑tuned Vision‑Language model (RelSim‑VL) – built on a pre‑trained CLIP backbone, trained to embed images such that relationally similar pairs are close in representation space.
Comprehensive evaluation – benchmarked against LPIPS, CLIP, DINO, and human judgments on a new relational similarity test set, showing a 30‑40 % improvement in correlation with human relational judgments.
Demonstration of downstream utility – applied RelSim‑VL to tasks like analogical image retrieval, scene‑graph generation, and zero‑shot reasoning, achieving measurable gains over baselines.

Methodology

Dataset construction
- Started from diverse image collections (COCO, Open Images, etc.).
- Human annotators wrote relational captions that abstract away from concrete nouns (“A round outer shell encloses a soft interior”) and focus on the role each visual element plays.
- Captions were “anonymized” (no object names) to force models to learn relational patterns rather than lexical shortcuts.
Model architecture
- Base: CLIP’s ViT‑B/32 image encoder + a transformer text encoder.
- Added a Relation Projection Head that maps the image embedding into a relational subspace.
- Training objective: contrastive loss that pulls together image pairs whose captions share the same relational template and pushes apart mismatched pairs.
Evaluation protocol
- Relational Similarity Test (RST): 5‑way multiple‑choice where humans pick the image that shares the same relational logic as a query.
- Correlation with human scores (Spearman’s ρ) and retrieval metrics (Recall@K).
- Ablation studies on caption anonymization, projection head size, and amount of relational data.

Results & Findings

Model	Spearman ρ (RST)	Recall@10 (analogical retrieval)
LPIPS	0.31	12 %
CLIP (raw)	0.38	18 %
DINO	0.35	15 %
RelSim‑VL (proposed)	0.57	31 %

Human‑aligned relational similarity: RelSim‑VL’s embeddings correlate far more strongly with human judgments than any prior metric.
Generalization: Even when presented with completely novel object categories (e.g., “a metallic shell around a liquid core”), the model correctly groups images by relational pattern.
Ablation: Removing caption anonymization drops ρ by ~0.08, confirming that the model truly learns relational abstractions rather than memorizing object names.

Practical Implications

Domain	How relational similarity helps	Example use‑case
Content‑based image search	Retrieve images that share the same structural logic (e.g., “layered architecture”) even if they look different.	A designer looking for “nested packaging” concepts finds photos of onions, Russian dolls, and geological cross‑sections.
Robotics & scene understanding	Reason about affordances and manipulation steps by matching relational patterns rather than exact objects.	A robot trained on “grasp the outer shell to expose the inner component” can transfer the skill from a fruit to a mechanical device.
Creative AI (storyboarding, game design)	Generate or retrieve assets that satisfy a narrative relational constraint (e.g., “hero’s shield protects the vulnerable core”).	Automated asset recommendation for a level‑designer building a “protect‑the‑core” puzzle.
Education & analogical reasoning tools	Provide visual analogies that reinforce relational thinking (e.g., Earth‑peach, solar‑system‑atom).	Interactive app that shows students pairs of images linked by relational similarity, fostering deeper conceptual links.
Medical imaging	Detect similar pathological structures across modalities (e.g., “central lesion surrounded by edema”) regardless of tissue contrast.	Aid radiologists in finding analogous cases across CT, MRI, and ultrasound.

By exposing a relational similarity signal, developers can build systems that think about images the way humans do—by the roles and functions of parts, not just by pixel‑level similarity.

Limitations & Future Work

Dataset bias: The relational captions are limited to the visual concepts present in the source image pools; rare or highly abstract relations may be under‑represented.
Dependence on language supervision: The model inherits CLIP’s reliance on large‑scale text data; purely visual relational learning (e.g., self‑supervised graph extraction) remains unexplored.
Scalability of fine‑tuning: Training the Relation Projection Head requires a substantial GPU budget; lighter‑weight adapters could make the approach more accessible.
Evaluation scope: Current benchmarks focus on static images; extending relational similarity to video (temporal relations) or 3‑D scenes is an open direction.

Future research could explore self‑supervised relational graph learning, cross‑modal relational reasoning (e.g., linking text narratives to visual structures), and real‑time relational retrieval pipelines for large‑scale image databases.

Authors

Thao Nguyen
Sicheng Mo
Krishna Kumar Singh
Yilin Wang
Jing Shi
Nicholas Kolkin
Eli Shechtman
Yong Jae Lee
Yuheng Li

Paper Information

arXiv ID: 2512.07833v1
Categories: cs.CV, cs.AI, cs.LG
Published: December 8, 2025
PDF: Download PDF

[Paper] Relational Visual Similarity

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Astra: General Interactive World Model with Autoregressive Denoising

[Paper] MatteViT: High-Frequency-Aware Document Shadow Removal with Shadow Matte Guidance

[Paper] Conditional Morphogenesis: Emergent Generation of Structural Digits via Neural Cellular Automata

[Paper] One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation