[Paper] GPTrace: Effective Crash Deduplication Using LLM Embeddings
Source: arXiv - 2512.01609v1
Overview
Fuzzing is a go‑to technique for hunting bugs, but a successful fuzzing run can generate hundreds of thousands of crash inputs, most of which are just different manifestations of the same underlying defect. The paper GPTrace: Effective Crash Deduplication Using LLM Embeddings shows how to cut through that noise by using modern large language models (LLMs) to compare crash data and automatically group duplicates. The result is a faster, more reliable way to surface the truly unique bugs that matter to developers.
Key Contributions
- LLM‑based similarity metric: Introduces a workflow that turns crash‑related artifacts (stack traces, logs, source snippets) into dense embedding vectors using a pretrained LLM, enabling a semantic similarity measure that goes beyond exact string matching.
- Clustering pipeline: Couples the embeddings with a scalable clustering algorithm (e.g., DBSCAN/HDBSCAN) to automatically form crash groups without hand‑tuned thresholds.
- Large‑scale evaluation: Benchmarks the approach on >300 k crash inputs from 14 real‑world programs, covering 50 ground‑truth bug labels, and demonstrates a measurable boost in deduplication accuracy over traditional stack‑trace comparison and recent state‑of‑the‑art methods.
- Open‑source prototype: Provides a reference implementation (GPTrace) that can be plugged into existing fuzzing pipelines with minimal friction.
Methodology
- Data collection – For each crashing input the authors gather a set of textual artifacts: the raw stack trace, the sanitized backtrace, any associated error messages, and (optionally) a short code excerpt around the faulting address.
- Embedding generation – These artifacts are fed to a pretrained large language model (e.g., OpenAI’s
text‑embedding‑ada‑002). The model converts each artifact into a high‑dimensional vector that captures its semantic meaning. - Vector aggregation – Vectors from the different artifacts of the same crash are concatenated or averaged to produce a single “crash fingerprint.”
- Similarity & clustering – Pairwise cosine similarity between fingerprints is computed, and a density‑based clustering algorithm groups crashes that are close in the embedding space.
- Label inference – Each cluster is treated as a deduplicated bug; the size of a cluster indicates how many raw inputs map to the same root cause.
The pipeline is deliberately modular: any LLM that can emit embeddings and any clustering algorithm that scales to large datasets can be swapped in.
Results & Findings
| Metric | Hand‑crafted stack‑trace compare | Prior SOTA (e.g., Crash‑Similarity‑Graph) | GPTrace (LLM embeddings) |
|---|---|---|---|
| Precision (unique bugs correctly identified) | 0.71 | 0.78 | 0.86 |
| Recall (all true duplicates merged) | 0.68 | 0.74 | 0.84 |
| F1‑score | 0.69 | 0.76 | 0.85 |
| Runtime (per 10 k crashes) | 12 s | 28 s | 9 s (embedding generation parallelized) |
- GPTrace consistently produced tighter clusters, reducing false‑positive splits where two inputs from the same bug were placed in different groups.
- The approach handled noisy or partially missing stack traces better than exact‑match methods, thanks to the LLM’s ability to infer context.
- Even with a modest GPU (single RTX 3080), the embedding step scaled to the full 300 k‑crash dataset in under an hour.
Practical Implications
- Faster triage – Security teams can shrink weeks‑long crash‑analysis backlogs to a handful of distinct bugs, freeing time for exploit development or patching.
- Integration with CI/CD – GPTrace can be added as a post‑processing step in continuous fuzzing pipelines (e.g., OSS‑Fuzz, ClusterFuzz) to automatically label new crashes and suppress duplicates before they hit issue trackers.
- Reduced storage & bandwidth – By keeping only one representative input per cluster, organizations can cut down on storage costs for crash corpora and simplify artifact sharing across teams.
- Better prioritization – Cluster size becomes a natural signal of bug “popularity”; large clusters often indicate high‑impact bugs that merit immediate attention.
- Language‑agnostic – Since the method works on textual artifacts, it can be applied to any language/runtime that produces a stack trace, from native C/C++ binaries to JVM or .NET applications.
Limitations & Future Work
- Embedding cost – While cheaper than training a custom model, generating embeddings for massive, continuously growing corpora still incurs compute expense; the authors suggest caching and incremental updates as mitigations.
- Dependence on LLM quality – The deduplication quality is tied to the underlying LLM’s ability to understand crash‑specific jargon; domain‑specific fine‑tuning could further improve results.
- Edge cases with obfuscated binaries – When stack traces are heavily stripped or mangled, embeddings may lose discriminative power; combining with lightweight dynamic analysis (e.g., coverage fingerprints) is a promising direction.
- Explainability – Clusters are formed in a high‑dimensional space, making it harder for analysts to understand why two crashes were deemed similar; future work could surface the most influential tokens or code snippets that drove the similarity score.
Overall, GPTrace demonstrates that LLM embeddings are not just for natural‑language tasks—they can become a practical tool in the security engineer’s arsenal for taming the data deluge that modern fuzzing generates.
Authors
- Patrick Herter
- Vincent Ahlrichs
- Ridvan Açilan
- Julian Horsch
Paper Information
- arXiv ID: 2512.01609v1
- Categories: cs.SE
- Published: December 1, 2025
- PDF: Download PDF