[Paper] Assessing the Software Security Comprehension of Large Language Models

Published: (December 24, 2025 at 10:29 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.21238v1

Overview

Large language models (LLMs) are now everyday assistants for developers, from auto‑completing code to suggesting security fixes. But how well do they actually understand software security? The paper Assessing the Software Security Comprehension of Large Language Models systematically measures the security knowledge of five state‑of‑the‑art LLMs (GPT‑4o‑Mini, GPT‑5‑Mini, Gemini‑2.5‑Flash, Llama‑3.1, and Qwen‑2.5) using Bloom’s Taxonomy as a lens for cognitive depth.

Key Contributions

  • Taxonomy‑driven benchmark: Introduces a multi‑level evaluation framework (Remember, Understand, Apply, Analyze, Evaluate, Create) tailored to software security.
  • Diverse data sources: Combines curated MCQs, the SALLM vulnerable‑code suite, university course assessments, real‑world case studies (XBOW), and open‑ended project tasks.
  • Knowledge‑boundary metric: Defines the software security knowledge boundary—the highest Bloom level at which a model consistently stays reliable.
  • Misconception catalog: Identifies 51 recurring error patterns (e.g., “confusing input validation with output encoding”) across models and Bloom levels.
  • Comprehensive comparative analysis: Benchmarks five leading LLMs, revealing systematic strengths and blind spots.

Methodology

  1. Bloom‑based task design – Each security concept is probed at six cognitive depths.
    • Remember: factual recall (e.g., “What is SQL injection?”).
    • Understand: explain a concept in own words.
    • Apply: locate a vulnerability in a snippet.
    • Analyze: compare two architectural designs for security.
    • Evaluate: critique a security policy or mitigation.
    • Create: synthesize a secure design or write a remediation plan.
  2. Dataset assembly
    • Multiple‑choice questions (≈2 k items) covering OWASP Top 10, cryptography basics, etc.
    • SALLM: a curated set of vulnerable code fragments with ground‑truth fixes.
    • Course assessments: mid‑term and final exams from an Intro‑to‑Software‑Security class.
    • XBOW case studies: real incidents (e.g., Log4Shell) requiring root‑cause analysis.
    • Project‑creation tasks: prompts asking the model to design a secure API or threat model.
  3. Prompting & evaluation
    • Uniform zero‑shot prompts for recall tasks; few‑shot examples for higher‑order tasks to mimic realistic developer interactions.
    • Automatic scoring for MCQs; human expert review (2‑person consensus) for open‑ended answers.
  4. Aggregating results – Accuracy per Bloom level, plus the knowledge boundary (the highest level with ≥80 % consistency).

Results & Findings

ModelRememberUnderstandApplyAnalyzeEvaluateCreate
GPT‑4o‑Mini96 %92 %88 %61 %45 %28 %
GPT‑5‑Mini95 %90 %85 %58 %42 %26 %
Gemini‑2.5‑Flash93 %88 %81 %55 %38 %24 %
Llama‑3.189 %81 %73 %48 %33 %19 %
Qwen‑2.587 %78 %70 %44 %30 %17 %
  • Strong low‑level performance: All models excel at fact recall and basic vulnerability identification (≥85 % accuracy).
  • Sharp drop after “Apply”: Reasoning about architecture, threat modeling, or secure design falls below 60 % for most models.
  • Knowledge boundary: For GPT‑4o‑Mini the boundary sits at the Apply level; for the others it is Understand.
  • Misconception patterns: The 51 error types cluster around “over‑generalizing mitigation advice,” “confusing authentication vs. authorization,” and “missing context‑specific constraints.”

Practical Implications

  • Developer tooling: Auto‑completion or code‑review assistants can be trusted for spotting known patterns (e.g., SQLi, XSS) but should not be relied upon for architectural security reviews or designing secure protocols.
  • Secure‑by‑prompt pipelines: Embedding LLMs in CI/CD for “quick checks” is viable, yet a human security engineer must still validate higher‑order recommendations.
  • Training data focus: The gap suggests that LLM pre‑training lacks deep security reasoning; fine‑tuning on threat‑modeling corpora could raise the knowledge boundary.
  • Compliance automation: For regulatory checklists (e.g., GDPR, PCI‑DSS) that map to factual recall, LLMs can generate draft evidence, but final sign‑off must involve experts.
  • Education & onboarding: New developers can use LLMs as “interactive textbooks” for learning basics, but should treat model‑generated design advice as a starting point, not a definitive solution.

Limitations & Future Work

  • Prompt sensitivity: Results may vary with alternative prompting strategies; the study used a fixed prompt set to emulate typical developer usage.
  • Domain coverage: The benchmark focuses on web‑app security (OWASP Top 10) and does not fully represent embedded, IoT, or cryptographic protocol domains.
  • Human evaluation bandwidth: Open‑ended tasks were judged by a limited pool of experts, which could introduce subjectivity.
  • Model updates: Rapid releases (e.g., GPT‑5‑Mini) may shift the knowledge boundary; continuous benchmarking is needed.

Bottom line: LLMs are already valuable allies for low‑level security tasks, but the leap to autonomous, high‑order security reasoning remains a work in progress. Developers should harness their strengths while keeping a human security expert in the loop for anything beyond “remember‑and‑apply.”

Authors

  • Mohammed Latif Siddiq
  • Natalie Sekerak
  • Antonio Karam
  • Maria Leal
  • Arvin Islam-Gomes
  • Joanna C. S. Santos

Paper Information

  • arXiv ID: 2512.21238v1
  • Categories: cs.SE, cs.CR, cs.LG
  • Published: December 24, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »