[Paper] Bridging the Modality Gap in Forensic Image Retrieval

Published: 3 days ago (June 10, 2026 at 12:32 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.12294v1

Overview

Automated image retrieval plays an increasingly critical role in modern forensic analysis, supporting investigative workflows that rely on efficient comparison of visual evidence. While prior work has focused primarily on developing and optimizing multimodal retrieval systems, limited attention has been paid to evaluating the forensic applicability of these technologies across diverse real-world scenarios. In this study, we present a unified retrieval framework adapted to four key forensic tasks: (1) tattoo image retrieval given a tattoo query image; (2) tattoo retrieval guided by human-expert textual descriptions, modelling the common situation where a witness verbally describes a tattoo; (3) tattoo retrieval from hand-drawn sketches; and (4) face retrieval from forensic face sketches. Our system leverages a multimodal large language model (MLLM) to automatically generate structured textual descriptions for all queries and gallery images, followed by sentence-transformer embedding for text-based comparison. We evaluate retrieval using visual-only embeddings, text-only embeddings and a multimodal fusion strategy that combines text- and image-based similarity scores derived from state-of-the-art visual feature extractors relevant to each task. The fusion of modalities consistently improves retrieval precision and robustness, especially in scenarios where visual information is limited or noisy (e.g., sketches, partial tattoos, or fragmented witness statements). This work highlights the forensic value of a unified multimodal retrieval pipeline and demonstrates how modern MLLMs can operationalize challenging forensic tasks that traditionally rely on manual expert analysis. Our results position multimodal retrieval as a promising tool for supporting investigative workflows involving tattoos, facial composites, and witness descriptions.

Key Contributions

This paper presents research in the following areas:

cs.CV
eess.IV

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CV.

Authors

Ricardo González-Gazapo
Annette Morales-González
Yoanna Martínez-Díaz
Heydi Méndez-Vázquez
Milton García-Borroto

Paper Information

arXiv ID: 2606.12294v1
Categories: cs.CV, eess.IV
Published: June 10, 2026
PDF: Download PDF

[Paper] Bridging the Modality Gap in Forensic Image Retrieval

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] InterleaveThinker: Reinforcing Agentic Interleaved Generation

[Paper] Mana: Dexterous Manipulation of Articulated Tools

[Paper] Modality Forcing for Scalable Spatial Generation

[Paper] RepWAM: World Action Modeling with Representation Visual-Action Tokenizers