[Paper] CrossCommitVuln-Bench: A Dataset of Multi-Commit Python Vulnerabilities Invisible to Per-Commit Static Analysis

Published: (April 23, 2026 at 01:57 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.21917v1

Overview

A new benchmark called CrossCommitVuln‑Bench shines a light on a blind spot in today’s static analysis tools: many Python security bugs only become exploitable after a series of seemingly harmless commits. The dataset captures 15 real‑world CVEs where each individual change would pass per‑commit scans, but the combined code introduces a serious vulnerability.

Key Contributions

  • Curated multi‑commit vulnerability dataset – 15 real Python CVEs annotated with the exact commit chain that creates the bug.
  • Structured annotation schema – each commit is labeled with why it evades per‑commit SAST (e.g., missing data flow, benign API usage).
  • Baseline evaluation – runs of popular tools (Semgrep, Bandit) in both per‑commit and cumulative (full‑repo) modes.
  • Empirical finding – per‑commit detection rate (CCDR) of only 13 %, and even cumulative detection climbs to just 27 %.
  • Open‑source release – dataset, annotations, and evaluation scripts are publicly available for reproducibility and further research.

Methodology

  1. Vulnerability selection – The authors mined public Python CVEs and filtered for cases where the exploitable condition was introduced across ≥ 2 commits.
  2. Manual annotation – For each CVE, they recorded:
    • the full commit chain,
    • a concise rationale for why each commit looks safe to a static analyzer, and
    • the point in the chain where the vulnerability becomes observable.
  3. Tool configuration – Semgrep and Bandit were run with their default rule sets in two modes:
    • Per‑commit – each commit scanned in isolation, mimicking CI pipelines that only analyze the diff.
    • Cumulative – the whole repository snapshot after the final commit, representing traditional “scan‑the‑code‑base” scans.
  4. Metrics – Detection was counted if any tool flagged the CVE’s root cause. The Cross‑Commit Detection Rate (CCDR) is the proportion of CVEs caught in per‑commit mode.

Results & Findings

Evaluation modeCVEs detectedDetection rate
Per‑commit (any tool)2 / 1513 %
Cumulative (any tool)4 / 1527 %
  • The two per‑commit detections were qualitatively weak: one was a false positive that developers later suppressed, the other only caught a minor hard‑coded key while missing the main flaw (exposed 200+ API endpoints).
  • Even when the full codebase was available, the tools missed 73 % of the multi‑commit bugs, confirming that snapshot‑based SAST struggles with vulnerabilities that rely on a temporal combination of changes.
  • The dataset exposes concrete patterns (e.g., incremental permission widening, staged configuration exposure) that current rule sets do not capture.

Practical Implications

  • CI pipelines need more than per‑commit diff scans. Teams should incorporate cumulative analyses or incremental data‑flow tracking that retains context across commits.
  • Rule authors can use the annotated commit chains to craft new patterns that detect staged privilege escalations or progressive configuration leaks.
  • Security‑focused code review: reviewers should be aware that a series of “harmless” changes can collectively create a risk, prompting checklist items for cross‑commit impact.
  • Tool vendors have a ready‑made benchmark to benchmark and improve their engines for temporal vulnerability detection, potentially leading to next‑generation SAST that reasons over commit histories.
  • Developers can adopt defensive coding practices (e.g., explicit security reviews when expanding API surfaces) to mitigate the risk of unintentionally building up a vulnerability over time.

Limitations & Future Work

  • Dataset size – only 15 CVEs; while carefully selected, broader coverage (more languages, larger sample) would strengthen generalizability.
  • Tool selection – evaluation focused on Semgrep and Bandit; other SAST, dynamic analysis, or hybrid tools might behave differently.
  • Annotation subjectivity – the rationale for why a commit evades detection is manually crafted and could vary between annotators.
  • Future directions suggested by the authors include: expanding the benchmark to other ecosystems (JavaScript, Go), integrating version‑control‑aware analysis techniques, and developing automated methods to infer cross‑commit vulnerability patterns from commit metadata.

Authors

  • Arunabh Majumdar

Paper Information

  • arXiv ID: 2604.21917v1
  • Categories: cs.CR, cs.SE
  • Published: April 23, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »