[Paper] CrossCommitVuln-Bench: A Dataset of Multi-Commit Python Vulnerabilities Invisible to Per-Commit Static Analysis
Source: arXiv - 2604.21917v1
Overview
A new benchmark called CrossCommitVuln‑Bench shines a light on a blind spot in today’s static analysis tools: many Python security bugs only become exploitable after a series of seemingly harmless commits. The dataset captures 15 real‑world CVEs where each individual change would pass per‑commit scans, but the combined code introduces a serious vulnerability.
Key Contributions
- Curated multi‑commit vulnerability dataset – 15 real Python CVEs annotated with the exact commit chain that creates the bug.
- Structured annotation schema – each commit is labeled with why it evades per‑commit SAST (e.g., missing data flow, benign API usage).
- Baseline evaluation – runs of popular tools (Semgrep, Bandit) in both per‑commit and cumulative (full‑repo) modes.
- Empirical finding – per‑commit detection rate (CCDR) of only 13 %, and even cumulative detection climbs to just 27 %.
- Open‑source release – dataset, annotations, and evaluation scripts are publicly available for reproducibility and further research.
Methodology
- Vulnerability selection – The authors mined public Python CVEs and filtered for cases where the exploitable condition was introduced across ≥ 2 commits.
- Manual annotation – For each CVE, they recorded:
- the full commit chain,
- a concise rationale for why each commit looks safe to a static analyzer, and
- the point in the chain where the vulnerability becomes observable.
- Tool configuration – Semgrep and Bandit were run with their default rule sets in two modes:
- Per‑commit – each commit scanned in isolation, mimicking CI pipelines that only analyze the diff.
- Cumulative – the whole repository snapshot after the final commit, representing traditional “scan‑the‑code‑base” scans.
- Metrics – Detection was counted if any tool flagged the CVE’s root cause. The Cross‑Commit Detection Rate (CCDR) is the proportion of CVEs caught in per‑commit mode.
Results & Findings
| Evaluation mode | CVEs detected | Detection rate |
|---|---|---|
| Per‑commit (any tool) | 2 / 15 | 13 % |
| Cumulative (any tool) | 4 / 15 | 27 % |
- The two per‑commit detections were qualitatively weak: one was a false positive that developers later suppressed, the other only caught a minor hard‑coded key while missing the main flaw (exposed 200+ API endpoints).
- Even when the full codebase was available, the tools missed 73 % of the multi‑commit bugs, confirming that snapshot‑based SAST struggles with vulnerabilities that rely on a temporal combination of changes.
- The dataset exposes concrete patterns (e.g., incremental permission widening, staged configuration exposure) that current rule sets do not capture.
Practical Implications
- CI pipelines need more than per‑commit diff scans. Teams should incorporate cumulative analyses or incremental data‑flow tracking that retains context across commits.
- Rule authors can use the annotated commit chains to craft new patterns that detect staged privilege escalations or progressive configuration leaks.
- Security‑focused code review: reviewers should be aware that a series of “harmless” changes can collectively create a risk, prompting checklist items for cross‑commit impact.
- Tool vendors have a ready‑made benchmark to benchmark and improve their engines for temporal vulnerability detection, potentially leading to next‑generation SAST that reasons over commit histories.
- Developers can adopt defensive coding practices (e.g., explicit security reviews when expanding API surfaces) to mitigate the risk of unintentionally building up a vulnerability over time.
Limitations & Future Work
- Dataset size – only 15 CVEs; while carefully selected, broader coverage (more languages, larger sample) would strengthen generalizability.
- Tool selection – evaluation focused on Semgrep and Bandit; other SAST, dynamic analysis, or hybrid tools might behave differently.
- Annotation subjectivity – the rationale for why a commit evades detection is manually crafted and could vary between annotators.
- Future directions suggested by the authors include: expanding the benchmark to other ecosystems (JavaScript, Go), integrating version‑control‑aware analysis techniques, and developing automated methods to infer cross‑commit vulnerability patterns from commit metadata.
Authors
- Arunabh Majumdar
Paper Information
- arXiv ID: 2604.21917v1
- Categories: cs.CR, cs.SE
- Published: April 23, 2026
- PDF: Download PDF