[Paper] Cross-modal Retrieval Models for Stripped Binary Analysis
Source: arXiv - 2512.10393v1
Overview
The paper presents BinSeek, a novel two‑stage cross‑modal retrieval system that lets developers search through massive collections of stripped binary functions using natural‑language queries. By bridging the gap between raw binary code (which lacks symbols and comments) and human‑readable descriptions, BinSeek makes large‑scale binary analysis far more interactive and practical for security‑focused workflows.
Key Contributions
- First cross‑modal retrieval framework for stripped binaries – introduces a pipeline that directly maps binary code to natural‑language semantics without relying on source‑level information.
- BinSeekEmbedding model – trained on a massive, synthetically generated dataset to learn joint embeddings of binary snippets and textual descriptions.
- BinSeek‑Reranker – a second‑stage model that refines the top‑k candidates using context augmentation, dramatically improving relevance judgments.
- LLM‑driven data synthesis pipeline – automatically creates high‑quality binary‑text pairs at scale, eliminating the need for costly manual annotation.
- New benchmark for stripped binary retrieval – provides a standardized dataset and evaluation metrics for future research in this niche.
- State‑of‑the‑art performance – outperforms same‑scale baselines by 31.42 % in Recall@3 and 27.17 % in MRR@3, and even beats much larger general‑purpose models (16× parameters).
Methodology
- Data Generation – An LLM is prompted to generate natural‑language descriptions for a large corpus of compiled functions (compiled with no symbols). The pipeline also injects realistic variations (e.g., different compiler flags, optimization levels) to improve robustness.
- Embedding Stage (BinSeekEmbedding) – A transformer‑based encoder processes the raw binary bytes (treated as a token sequence) and the textual description in parallel, learning a shared latent space where semantically related pairs are close together. Contrastive loss drives the alignment.
- Candidate Retrieval – At query time, the description is encoded, and a fast approximate nearest‑neighbor search (e.g., FAISS) returns the top‑k binary functions.
- Reranking Stage (BinSeek‑Reranker) – The top‑k candidates are fed into a second transformer that incorporates context augmentation (e.g., surrounding functions, control‑flow graphs) to produce a refined relevance score. The final ranking is the output presented to the user.
The whole pipeline runs end‑to‑end on commodity GPUs and can be integrated into existing LLM‑agent security tools.
Results & Findings
| Metric | BinSeek | Same‑scale baseline | 16× larger general model |
|---|---|---|---|
| Recall@3 | 0.84 | 0.64 | 0.71 |
| MRR@3 | 0.78 | 0.61 | 0.68 |
- 31.42 % relative gain in Recall@3 and 27.17 % in MRR@3 over the strongest same‑size baseline.
- The reranker contributes roughly 12 % of the total improvement, confirming that context matters even when the embedding already captures semantic similarity.
- Ablation studies show that synthetic data quality (LLM prompt engineering) directly correlates with downstream retrieval performance.
Practical Implications
- Security tooling – Integrate BinSeek into vulnerability scanners or malware analysis platforms to let analysts type “function that decrypts network traffic” and instantly retrieve matching stripped binaries.
- LLM‑agent workflows – Agents can now fetch concrete code snippets as evidence for their reasoning, improving explainability and reducing false positives.
- Reverse‑engineering automation – Large codebases (e.g., firmware images) can be indexed once, then queried repeatedly without re‑disassembly, saving hours of manual work.
- Cross‑team collaboration – Developers can share natural‑language tags for binary components, enabling a “search‑by‑description” experience similar to code search in source‑level repositories.
Limitations & Future Work
- Synthetic bias – The training data is fully generated by LLMs, which may miss rare or highly obfuscated patterns found in real malware.
- Binary diversity – Current experiments focus on x86/ARM binaries compiled with common toolchains; exotic architectures or custom packers remain untested.
- Scalability of reranking – While the first stage scales to millions of functions, the reranker still requires GPU inference for each query’s top‑k, which could be a bottleneck in ultra‑large repositories.
- Future directions suggested by the authors include incorporating dynamic analysis traces, expanding to multi‑modal inputs (e.g., decompiled pseudo‑code), and fine‑tuning on real‑world annotated datasets to close the synthetic‑real gap.
Authors
- Guoqiang Chen
- Lingyun Ying
- Ziyang Song
- Daguang Liu
- Qiang Wang
- Zhiqi Wang
- Li Hu
- Shaoyin Cheng
- Weiming Zhang
- Nenghai Yu
Paper Information
- arXiv ID: 2512.10393v1
- Categories: cs.SE, cs.AI
- Published: December 11, 2025
- PDF: Download PDF