[Paper] Cross-modal Retrieval Models for Stripped Binary Analysis

Published: 1 month ago (December 11, 2025 at 02:58 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.10393v1

Overview

The paper presents BinSeek, a novel two‑stage cross‑modal retrieval system that lets developers search through massive collections of stripped binary functions using natural‑language queries. By bridging the gap between raw binary code (which lacks symbols and comments) and human‑readable descriptions, BinSeek makes large‑scale binary analysis far more interactive and practical for security‑focused workflows.

Key Contributions

First cross‑modal retrieval framework for stripped binaries – introduces a pipeline that directly maps binary code to natural‑language semantics without relying on source‑level information.
BinSeekEmbedding model – trained on a massive, synthetically generated dataset to learn joint embeddings of binary snippets and textual descriptions.
BinSeek‑Reranker – a second‑stage model that refines the top‑k candidates using context augmentation, dramatically improving relevance judgments.
LLM‑driven data synthesis pipeline – automatically creates high‑quality binary‑text pairs at scale, eliminating the need for costly manual annotation.
New benchmark for stripped binary retrieval – provides a standardized dataset and evaluation metrics for future research in this niche.
State‑of‑the‑art performance – outperforms same‑scale baselines by 31.42 % in Recall@3 and 27.17 % in MRR@3, and even beats much larger general‑purpose models (16× parameters).

Methodology

Data Generation – An LLM is prompted to generate natural‑language descriptions for a large corpus of compiled functions (compiled with no symbols). The pipeline also injects realistic variations (e.g., different compiler flags, optimization levels) to improve robustness.
Embedding Stage (BinSeekEmbedding) – A transformer‑based encoder processes the raw binary bytes (treated as a token sequence) and the textual description in parallel, learning a shared latent space where semantically related pairs are close together. Contrastive loss drives the alignment.
Candidate Retrieval – At query time, the description is encoded, and a fast approximate nearest‑neighbor search (e.g., FAISS) returns the top‑k binary functions.
Reranking Stage (BinSeek‑Reranker) – The top‑k candidates are fed into a second transformer that incorporates context augmentation (e.g., surrounding functions, control‑flow graphs) to produce a refined relevance score. The final ranking is the output presented to the user.

The whole pipeline runs end‑to‑end on commodity GPUs and can be integrated into existing LLM‑agent security tools.

Results & Findings

Metric	BinSeek	Same‑scale baseline	16× larger general model
Recall@3	0.84	0.64	0.71
MRR@3	0.78	0.61	0.68

31.42 % relative gain in Recall@3 and 27.17 % in MRR@3 over the strongest same‑size baseline.
The reranker contributes roughly 12 % of the total improvement, confirming that context matters even when the embedding already captures semantic similarity.
Ablation studies show that synthetic data quality (LLM prompt engineering) directly correlates with downstream retrieval performance.

Practical Implications

Security tooling – Integrate BinSeek into vulnerability scanners or malware analysis platforms to let analysts type “function that decrypts network traffic” and instantly retrieve matching stripped binaries.
LLM‑agent workflows – Agents can now fetch concrete code snippets as evidence for their reasoning, improving explainability and reducing false positives.
Reverse‑engineering automation – Large codebases (e.g., firmware images) can be indexed once, then queried repeatedly without re‑disassembly, saving hours of manual work.
Cross‑team collaboration – Developers can share natural‑language tags for binary components, enabling a “search‑by‑description” experience similar to code search in source‑level repositories.

Limitations & Future Work

Synthetic bias – The training data is fully generated by LLMs, which may miss rare or highly obfuscated patterns found in real malware.
Binary diversity – Current experiments focus on x86/ARM binaries compiled with common toolchains; exotic architectures or custom packers remain untested.
Scalability of reranking – While the first stage scales to millions of functions, the reranker still requires GPU inference for each query’s top‑k, which could be a bottleneck in ultra‑large repositories.
Future directions suggested by the authors include incorporating dynamic analysis traces, expanding to multi‑modal inputs (e.g., decompiled pseudo‑code), and fine‑tuning on real‑world annotated datasets to close the synthetic‑real gap.

Authors

Guoqiang Chen
Lingyun Ying
Ziyang Song
Daguang Liu
Qiang Wang
Zhiqi Wang
Li Hu
Shaoyin Cheng
Weiming Zhang
Nenghai Yu

Paper Information

arXiv ID: 2512.10393v1
Categories: cs.SE, cs.AI
Published: December 11, 2025
PDF: Download PDF

[Paper] Cross-modal Retrieval Models for Stripped Binary Analysis

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] A General Algorithm for Detecting Higher-Order Interactions via Random Sequential Additions

[Paper] Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

[Paper] Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously