[Paper] Beyond Function-Level Analysis: Context-Aware Reasoning for Inter-Procedural Vulnerability Detection

Published: 3 days ago (February 6, 2026 at 09:49 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2602.06751v1

Overview

The paper introduces CPRVul, a new framework that moves vulnerability detection beyond the traditional “single‑function” view. By intelligently pulling in and reasoning over the surrounding code context, CPRVul achieves markedly higher detection accuracy on several real‑world vulnerability datasets.

Key Contributions

Context‑aware pipeline that profiles, scores, and selects only the most relevant inter‑procedural code snippets for analysis.
Structured reasoning using large language models (LLMs) that generate step‑by‑step security traces instead of a single binary prediction.
Code Property Graph (CPG) integration to capture data‑, control‑, and call‑graph relationships, enabling precise context extraction.
Empirical gains: 22.9 % absolute improvement on the PrimeVul benchmark (67.78 % vs. 55.17 % accuracy) and consistent lifts across TitanVul and CleanVul.
Ablation study showing that raw context hurts performance, while the combination of curated context + reasoning yields the boost.

Methodology

Context Profiling & Selection
- Build a Code Property Graph for the whole project, linking functions, variables, and control flow.
- Generate a pool of candidate context nodes (e.g., callers, callees, shared globals).
- Prompt an LLM with a security‑focused prompt to produce a profile for each candidate (e.g., “does this function handle user input?”).
- Assign a relevance score; keep only the top‑k items that fit within the LLM’s token window.
Structured Reasoning
- Assemble a prompt that concatenates:
  - the target function,
  - the selected high‑impact context snippets,
  - auxiliary metadata (CWE IDs, known vulnerable patterns).
- Ask the LLM to trace its reasoning (e.g., “Step 1: data flows from read() to strcpy(); Step 2: missing bounds check”).
- Collect these reasoning traces as training data and fine‑tune the LLM to output a final “Vulnerable / Not Vulnerable” label.
Training & Evaluation
- Fine‑tune on three curated datasets (PrimeVul, TitanVul, CleanVul) that already filter out noisy commits and label errors.
- Compare against strong baselines like UniXcoder and other function‑only detectors.

Results & Findings

Dataset	Function‑only baseline (UniXcoder)	CPRVul	Relative gain
PrimeVul	55.17 %	67.78 %	+22.9 %
TitanVul	56.65 % → 64.94 %	+8.3 %
CleanVul	63.68 % → 73.76 %	+10.1 %

Raw context hurts: feeding the entire call‑graph to the model degrades accuracy.
Processed context alone isn’t enough: selecting snippets without reasoning yields marginal gains.
Synergy matters: the biggest jump appears when curated context is paired with the LLM’s step‑wise reasoning trace.

Practical Implications

More reliable static analysis tools: Integrating CPRVul‑style reasoning can reduce false positives/negatives that plague current linters and SAST products.
Developer‑centric alerts: The generated reasoning trace can be surfaced directly in IDEs, giving engineers a clear “why” behind a vulnerability flag.
Scalable code review pipelines: Because CPRVul selects only a handful of high‑impact snippets, it stays within token limits of commercial LLM APIs, making it feasible for CI/CD integration.
Cross‑language potential: The CPG abstraction is language‑agnostic, so the approach could be adapted to Java, JavaScript, or Rust with modest effort.
Security‑oriented code assistants: Future AI pair‑programmers can leverage the same profiling + reasoning loop to suggest safe refactorings on the fly.

Limitations & Future Work

Context window dependency: The selection step is tuned to fit current LLM token limits; larger models or future architectures may require re‑balancing.
Dataset bias: Evaluation is limited to three high‑quality, but still C‑centric, vulnerability corpora; performance on other languages or low‑resource projects remains untested.
LLM reliance: The quality of the security profile and reasoning trace hinges on the underlying LLM’s knowledge; updates or model drift could affect consistency.
Future directions suggested by the authors include: extending the pipeline to handle multi‑module projects, exploring automated feedback loops where the model suggests context‑pruning strategies, and integrating dynamic analysis signals (e.g., runtime taint) to complement the static CPG.

Authors

Yikun Li
Ting Zhang
Jieke Shi
Chengran Yang
Junda He
Xin Zhou
Jinfeng Jiang
Huihui Huang
Wen Bin Leow
Yide Yin
Eng Lieh Ouh
Lwin Khin Shar
David Lo

Paper Information

arXiv ID: 2602.06751v1
Categories: cs.CR, cs.SE
Published: February 6, 2026
PDF: Download PDF

[Paper] Beyond Function-Level Analysis: Context-Aware Reasoning for Inter-Procedural Vulnerability Detection

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Statistical-Based Metric Threshold Setting Method for Software Fault Prediction in Firmware Projects: An Industrial Experience

[Paper] Using Large Language Models to Support Automation of Failure Management in CI/CD Pipelines: A Case Study in SAP HANA

[Paper] Code vs Serialized AST Inputs for LLM-Based Code Summarization: An Empirical Study

[Paper] Identifying Adversary Tactics and Techniques in Malware Binaries with an LLM Agent