[Paper] Identifying Adversary Tactics and Techniques in Malware Binaries with an LLM Agent

Published: 3 days ago (February 5, 2026 at 09:42 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.06325v1

Overview

The paper introduces TTPDetect, the first large‑language‑model (LLM)‑driven agent that can automatically pinpoint tactics, techniques, and procedures (TTPs) hidden inside stripped malware binaries. By marrying dense code retrieval with on‑the‑fly reasoning, the system bridges the gap between raw binary analysis and actionable threat‑intel, achieving near‑human‑level precision on real‑world samples.

Key Contributions

LLM‑based malware TTP agent – First end‑to‑end system that uses an LLM as an “analysis assistant” to map decompiled functions to ATT&CK‑style TTPs.
Hybrid retrieval pipeline – Combines traditional dense vector retrieval with LLM‑guided neural retrieval to efficiently locate promising entry‑point functions in massive, symbol‑less binaries.
Context Explorer – A function‑level agent that incrementally pulls in surrounding code (call‑graph, data‑flow, control‑flow) only when needed, keeping the LLM prompt size manageable.
TTP‑Specific Reasoning Guideline – A set of inference‑time prompts that steer the LLM toward ATT&CK‑aligned decision logic, reducing hallucinations.
New labeled dataset – Over 30 k decompiled functions from diverse malware families (Windows, Linux, Android) annotated with ATT&CK TTPs, released for reproducibility.
Strong empirical results – 93 %+ precision/recall on function‑level TTP detection and 87 % precision on full‑sample evaluation, outperforming prior static‑analysis baselines by up to 19 %.

Methodology

Pre‑processing & Decompilation – Raw binaries are stripped of symbols, then decompiled (e.g., using Ghidra/IDA) into a function‑level intermediate representation.
Dense Retrieval – Each function is embedded with a code‑specific encoder (e.g., CodeBERT). A nearest‑neighbor search quickly narrows the candidate set to the top‑k functions that look “malicious.”
Neural Retrieval with LLM – The LLM receives the query (“Find functions that implement credential dumping”) and the top‑k candidates, re‑ranking them based on its internal code understanding.
Context Explorer Agent – For a selected candidate, the agent lazily expands the context: it pulls the caller/callee functions, relevant data structures, and control‑flow snippets only when the LLM asks for more information. This keeps prompts short while still providing the full reasoning picture.
TTP‑Specific Reasoning Guideline – A prompt template encodes ATT&CK definitions, typical code patterns, and decision thresholds. The LLM follows this guideline to output a TTP label (or “none”).
Iterative Refinement – The system repeats steps 3‑5 for each high‑scoring function, aggregating TTPs at the binary level.

Results & Findings

Metric	Function‑level (test set)	Full‑sample (real malware)
Precision	93.25 %	87.37 %
Recall	93.81 %	—
F1	93.53 %	—
Baseline (static‑analysis)	+10.38 % precision, +18.78 % recall	—
Recovery of expert‑written TTPs	—	85.7 %
New TTPs discovered per sample	—	10.5 (average)

Takeaway: TTPDetect not only matches human analysts in spotting known techniques but also uncovers a substantial number of previously undocumented behaviors, demonstrating its utility for threat‑intel enrichment.

Practical Implications

Automated Threat Intel Generation – Security teams can feed newly captured binaries into TTPDetect and instantly receive ATT&CK‑aligned TTP reports, cutting weeks of manual reverse‑engineering down to hours.
Prioritization of Incident Response – By surfacing high‑impact techniques (e.g., credential dumping, lateral movement), analysts can triage alerts more effectively.
Integration with SIEM/EDR – The function‑level TTP tags can be exported as structured indicators (STIX/TAXII), feeding downstream detection rules and behavioral analytics.
Malware Family Attribution – Consistent TTP fingerprints across samples help cluster unknown binaries into existing campaigns, aiding attribution and proactive defense.
Open‑source Research Catalyst – The released dataset and retrieval pipeline provide a baseline for future work on LLM‑driven binary analysis, encouraging community extensions (e.g., multi‑modal models that ingest raw bytes).

Limitations & Future Work

Dependence on Decompilation Quality – Stripped binaries with heavy obfuscation may yield inaccurate function boundaries, limiting the agent’s recall.
Prompt Length Constraints – Although the Context Explorer mitigates this, extremely large call graphs can still exceed model context windows.
LLM Hallucination Risk – Despite the reasoning guideline, occasional false TTP assignments occur, especially for novel or hybrid techniques not seen during training.
Platform Coverage – The current evaluation focuses on Windows, Linux, and Android; extending to IoT firmware or macOS binaries remains open.
Dynamic Behavior Fusion – Future versions could combine static LLM reasoning with dynamic execution traces (e.g., sandbox logs) for richer TTP inference.

Bottom line: TTPDetect showcases how LLM agents, when paired with smart retrieval and domain‑specific prompting, can transform raw, symbol‑less malware binaries into actionable threat intelligence—an advancement that promises to accelerate defensive workflows across the security industry.*

Authors

Zhou Xuan
Xiangzhe Xu
Mingwei Zheng
Louis Zheng-Hua Tan
Jinyao Guo
Tiantai Zhang
Le Yu
Chengpeng Wang
Xiangyu Zhang

Paper Information

arXiv ID: 2602.06325v1
Categories: cs.CR, cs.SE
Published: February 6, 2026
PDF: Download PDF

[Paper] Identifying Adversary Tactics and Techniques in Malware Binaries with an LLM Agent

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Statistical-Based Metric Threshold Setting Method for Software Fault Prediction in Firmware Projects: An Industrial Experience

[Paper] Beyond Function-Level Analysis: Context-Aware Reasoning for Inter-Procedural Vulnerability Detection

[Paper] Using Large Language Models to Support Automation of Failure Management in CI/CD Pipelines: A Case Study in SAP HANA

[Paper] Code vs Serialized AST Inputs for LLM-Based Code Summarization: An Empirical Study