[Paper] CrossCommitVuln-Bench: A Dataset of Multi-Commit Python Vulnerabilities Invisible to Per-Commit Static Analysis

Published: 10 hours ago (April 23, 2026 at 01:57 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.21917v1

Overview

A new benchmark called CrossCommitVuln‑Bench shines a light on a blind spot in today’s static analysis tools: many Python security bugs only become exploitable after a series of seemingly harmless commits. The dataset captures 15 real‑world CVEs where each individual change would pass per‑commit scans, but the combined code introduces a serious vulnerability.

Key Contributions

Curated multi‑commit vulnerability dataset – 15 real Python CVEs annotated with the exact commit chain that creates the bug.
Structured annotation schema – each commit is labeled with why it evades per‑commit SAST (e.g., missing data flow, benign API usage).
Baseline evaluation – runs of popular tools (Semgrep, Bandit) in both per‑commit and cumulative (full‑repo) modes.
Empirical finding – per‑commit detection rate (CCDR) of only 13 %, and even cumulative detection climbs to just 27 %.
Open‑source release – dataset, annotations, and evaluation scripts are publicly available for reproducibility and further research.

Methodology

Vulnerability selection – The authors mined public Python CVEs and filtered for cases where the exploitable condition was introduced across ≥ 2 commits.
Manual annotation – For each CVE, they recorded:
- the full commit chain,
- a concise rationale for why each commit looks safe to a static analyzer, and
- the point in the chain where the vulnerability becomes observable.
Tool configuration – Semgrep and Bandit were run with their default rule sets in two modes:
- Per‑commit – each commit scanned in isolation, mimicking CI pipelines that only analyze the diff.
- Cumulative – the whole repository snapshot after the final commit, representing traditional “scan‑the‑code‑base” scans.
Metrics – Detection was counted if any tool flagged the CVE’s root cause. The Cross‑Commit Detection Rate (CCDR) is the proportion of CVEs caught in per‑commit mode.

Results & Findings

Evaluation mode	CVEs detected	Detection rate
Per‑commit (any tool)	2 / 15	13 %
Cumulative (any tool)	4 / 15	27 %

The two per‑commit detections were qualitatively weak: one was a false positive that developers later suppressed, the other only caught a minor hard‑coded key while missing the main flaw (exposed 200+ API endpoints).
Even when the full codebase was available, the tools missed 73 % of the multi‑commit bugs, confirming that snapshot‑based SAST struggles with vulnerabilities that rely on a temporal combination of changes.
The dataset exposes concrete patterns (e.g., incremental permission widening, staged configuration exposure) that current rule sets do not capture.

Practical Implications

CI pipelines need more than per‑commit diff scans. Teams should incorporate cumulative analyses or incremental data‑flow tracking that retains context across commits.
Rule authors can use the annotated commit chains to craft new patterns that detect staged privilege escalations or progressive configuration leaks.
Security‑focused code review: reviewers should be aware that a series of “harmless” changes can collectively create a risk, prompting checklist items for cross‑commit impact.
Tool vendors have a ready‑made benchmark to benchmark and improve their engines for temporal vulnerability detection, potentially leading to next‑generation SAST that reasons over commit histories.
Developers can adopt defensive coding practices (e.g., explicit security reviews when expanding API surfaces) to mitigate the risk of unintentionally building up a vulnerability over time.

Limitations & Future Work

Dataset size – only 15 CVEs; while carefully selected, broader coverage (more languages, larger sample) would strengthen generalizability.
Tool selection – evaluation focused on Semgrep and Bandit; other SAST, dynamic analysis, or hybrid tools might behave differently.
Annotation subjectivity – the rationale for why a commit evades detection is manually crafted and could vary between annotators.
Future directions suggested by the authors include: expanding the benchmark to other ecosystems (JavaScript, Go), integrating version‑control‑aware analysis techniques, and developing automated methods to infer cross‑commit vulnerability patterns from commit metadata.

Authors

Arunabh Majumdar

Paper Information

arXiv ID: 2604.21917v1
Categories: cs.CR, cs.SE
Published: April 23, 2026
PDF: Download PDF

[Paper] CrossCommitVuln-Bench: A Dataset of Multi-Commit Python Vulnerabilities Invisible to Per-Commit Static Analysis

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Institutionalizing Best Practices in Research Computing: A Framework and Case Study for Improving User Onboarding

[Paper] Generalizing Test Cases for Comprehensive Test Scenario Coverage

[Paper] Less Is More: Measuring How LLM Involvement affects Chatbot Accuracy in Static Analysis

[Paper] Autonomous LLM-generated Feedback for Student Exercises in Introductory Software Engineering Courses