[Paper] Beyond Function-Level Analysis: Context-Aware Reasoning for Inter-Procedural Vulnerability Detection

Published: (February 6, 2026 at 09:49 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2602.06751v1

Overview

The paper introduces CPRVul, a new framework that moves vulnerability detection beyond the traditional “single‑function” view. By intelligently pulling in and reasoning over the surrounding code context, CPRVul achieves markedly higher detection accuracy on several real‑world vulnerability datasets.

Key Contributions

  • Context‑aware pipeline that profiles, scores, and selects only the most relevant inter‑procedural code snippets for analysis.
  • Structured reasoning using large language models (LLMs) that generate step‑by‑step security traces instead of a single binary prediction.
  • Code Property Graph (CPG) integration to capture data‑, control‑, and call‑graph relationships, enabling precise context extraction.
  • Empirical gains: 22.9 % absolute improvement on the PrimeVul benchmark (67.78 % vs. 55.17 % accuracy) and consistent lifts across TitanVul and CleanVul.
  • Ablation study showing that raw context hurts performance, while the combination of curated context + reasoning yields the boost.

Methodology

  1. Context Profiling & Selection

    • Build a Code Property Graph for the whole project, linking functions, variables, and control flow.
    • Generate a pool of candidate context nodes (e.g., callers, callees, shared globals).
    • Prompt an LLM with a security‑focused prompt to produce a profile for each candidate (e.g., “does this function handle user input?”).
    • Assign a relevance score; keep only the top‑k items that fit within the LLM’s token window.
  2. Structured Reasoning

    • Assemble a prompt that concatenates:
      • the target function,
      • the selected high‑impact context snippets,
      • auxiliary metadata (CWE IDs, known vulnerable patterns).
    • Ask the LLM to trace its reasoning (e.g., “Step 1: data flows from read() to strcpy(); Step 2: missing bounds check”).
    • Collect these reasoning traces as training data and fine‑tune the LLM to output a final “Vulnerable / Not Vulnerable” label.
  3. Training & Evaluation

    • Fine‑tune on three curated datasets (PrimeVul, TitanVul, CleanVul) that already filter out noisy commits and label errors.
    • Compare against strong baselines like UniXcoder and other function‑only detectors.

Results & Findings

DatasetFunction‑only baseline (UniXcoder)CPRVulRelative gain
PrimeVul55.17 %67.78 %+22.9 %
TitanVul56.65 % → 64.94 %+8.3 %
CleanVul63.68 % → 73.76 %+10.1 %
  • Raw context hurts: feeding the entire call‑graph to the model degrades accuracy.
  • Processed context alone isn’t enough: selecting snippets without reasoning yields marginal gains.
  • Synergy matters: the biggest jump appears when curated context is paired with the LLM’s step‑wise reasoning trace.

Practical Implications

  • More reliable static analysis tools: Integrating CPRVul‑style reasoning can reduce false positives/negatives that plague current linters and SAST products.
  • Developer‑centric alerts: The generated reasoning trace can be surfaced directly in IDEs, giving engineers a clear “why” behind a vulnerability flag.
  • Scalable code review pipelines: Because CPRVul selects only a handful of high‑impact snippets, it stays within token limits of commercial LLM APIs, making it feasible for CI/CD integration.
  • Cross‑language potential: The CPG abstraction is language‑agnostic, so the approach could be adapted to Java, JavaScript, or Rust with modest effort.
  • Security‑oriented code assistants: Future AI pair‑programmers can leverage the same profiling + reasoning loop to suggest safe refactorings on the fly.

Limitations & Future Work

  • Context window dependency: The selection step is tuned to fit current LLM token limits; larger models or future architectures may require re‑balancing.
  • Dataset bias: Evaluation is limited to three high‑quality, but still C‑centric, vulnerability corpora; performance on other languages or low‑resource projects remains untested.
  • LLM reliance: The quality of the security profile and reasoning trace hinges on the underlying LLM’s knowledge; updates or model drift could affect consistency.
  • Future directions suggested by the authors include: extending the pipeline to handle multi‑module projects, exploring automated feedback loops where the model suggests context‑pruning strategies, and integrating dynamic analysis signals (e.g., runtime taint) to complement the static CPG.

Authors

  • Yikun Li
  • Ting Zhang
  • Jieke Shi
  • Chengran Yang
  • Junda He
  • Xin Zhou
  • Jinfeng Jiang
  • Huihui Huang
  • Wen Bin Leow
  • Yide Yin
  • Eng Lieh Ouh
  • Lwin Khin Shar
  • David Lo

Paper Information

  • arXiv ID: 2602.06751v1
  • Categories: cs.CR, cs.SE
  • Published: February 6, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »