[Paper] R+R: Reassessing Java Security API Misuse in Current LLMs: A Replication on JCA and JSSE APIs with External Security Knowledge

Published: 1 week ago (May 29, 2026 at 06:46 AM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.31135v1

Overview

A recent replication study revisits the alarming trend of Java security API misuse in code generated by large language models (LLMs). By testing the latest proprietary model (GPT‑5.5) and a cutting‑edge open‑weight model (Llama‑3.3‑70B‑Instruct) on the Java Cryptography Architecture (JCA) and Java Secure Socket Extension (JSSE) APIs, the authors show that while newer models are better, insecure code is still being produced—unless the model is supplied with targeted security knowledge.

Key Contributions

Replication of a 2024 benchmark on JCA/JSSE misuse, now using GPT‑5.5 and Llama‑3.3‑70B‑Instruct.
Systematic evaluation of external security knowledge (secure code snippets, misuse patterns, developer guides, and secure prompting) and its impact on model output.
Discovery of model‑dependent effects:
- For Llama‑3.3‑70B‑Instruct, secure code examples are the single most powerful knowledge source.
- For GPT‑5.5, explicit misuse‑pattern prompts completely eliminate detectable misuses in syntactically valid programs.
Quantitative analysis of residual problems: even the best‑performing model still emits compilation errors or API‑mismatch code when no external knowledge is supplied.
Practical guidance on how to augment LLM‑driven code generation pipelines with retrieval‑augmented security knowledge.

Methodology

Benchmark Construction – The authors reused the “Java Security API Misuse” benchmark from Mousavi et al., which contains 50 representative tasks covering common JCA/JSSE usage scenarios (key generation, cipher configuration, TLS socket setup, etc.).
Model Selection – Two state‑of‑the‑art models were evaluated:
- GPT‑5.5 (closed‑source, accessed via OpenAI API).
- Llama‑3.3‑70B‑Instruct (open‑weight, run locally).
Knowledge Injection Strategies – Four types of external security knowledge were retrieved from a curated knowledge base and appended to the prompt:
- Secure code examples (minimal, correct snippets).
- Misuse patterns (common pitfalls to avoid).
- Developer‑guide excerpts (official JCA/JSSE docs).
- Secure prompting (instructions to “produce security‑first code”).
Evaluation Metrics – Generated programs were compiled, executed against a test harness, and classified as:
- Valid & Secure – compiles, runs, and follows best‑practice API usage.
- Valid but Insecure – compiles/runs but contains a known misuse.
- Invalid – fails to compile or targets the wrong API.
Statistical Analysis – Results were aggregated per model, per knowledge type, and per combination to assess additive effects.

Results & Findings

Model	Baseline (no knowledge)	+ Secure Code	+ Misuse Patterns	+ Developer Guide	+ Secure Prompt
GPT‑5.5	68 % valid, 42 % insecure	81 % valid, 15 % insecure	0 % insecure (all valid programs secure)	78 % valid, 12 % insecure	85 % valid, 8 % insecure
Llama‑3.3‑70B‑Instruct	55 % valid, 48 % insecure	90 % valid, 9 % insecure	70 % valid, 30 % insecure	65 % valid, 35 % insecure	72 % valid, 22 % insecure

Overall improvement: Both models generate more secure code than the 2024 baseline, but misuse rates remain non‑trivial without augmentation.
Model‑specific knowledge impact: Secure code examples dramatically help Llama‑3.3‑70B‑Instruct, while explicit misuse‑pattern prompts are decisive for GPT‑5.5.
Residual errors: Even with the best knowledge injection, GPT‑5.5 still produces ~15 % compilation‑error outputs, and Llama‑3.3‑70B‑Instruct leaves ~9 % insecure snippets.
Retrieval‑augmented prompting is not a silver bullet; its effectiveness hinges on the model’s internal reasoning capabilities.

Practical Implications

Augment LLM‑powered IDE plugins – Integrate a lightweight retrieval layer that fetches secure JCA/JSSE examples or misuse patterns before sending the prompt to the model.
Model‑aware knowledge selection – Choose the knowledge type that aligns with the model you’re using (e.g., prioritize secure snippets for open‑weight models, misuse‑pattern prompts for proprietary models).
Automated post‑generation validation – Pair LLM output with static analysis tools (e.g., SpotBugs, FindSecBugs) to catch the remaining insecure or non‑compiling code.
Self‑hosted deployments – For teams running Llama‑3.3‑70B‑Instruct locally, a modest knowledge base can bring security performance close to that of a closed‑source premium model, reducing reliance on costly APIs.
Secure prompting best practices – Explicitly ask the model to “follow Java security best practices” and to “avoid deprecated algorithms”; this simple instruction yields measurable gains, especially for GPT‑5.5.

Limitations & Future Work

Benchmark scope – The study focuses only on JCA and JSSE; other Java security libraries (e.g., Bouncy Castle, Spring Security) remain untested.
Knowledge base quality – Results depend on the relevance and cleanliness of the retrieved snippets; noisy or outdated knowledge could degrade performance.
Model versions – Rapid model updates may shift the balance of which knowledge type is most effective, requiring continuous re‑evaluation.
Human‑in‑the‑loop – The experiments assume fully automated generation; real‑world developer interaction (editing, reviewing) could further mitigate misuse but was not modeled.
Future directions – Extending the replication to other languages (Python, Go), exploring fine‑tuning with security‑focused data, and building end‑to‑end pipelines that combine retrieval, LLM generation, and static analysis in a single developer workflow.

Authors

Tianhe Lu
Eric Spero
Sakuna Harinda Jayasundara
Robert Biddle
Giovanni Russello

Paper Information

arXiv ID: 2605.31135v1
Categories: cs.CR, cs.SE
Published: May 29, 2026
PDF: Download PDF

[Paper] R+R: Reassessing Java Security API Misuse in Current LLMs: A Replication on JCA and JSSE APIs with External Security Knowledge

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Ladder Logic Translation using Large Language Models in Industrial Automation

[Paper] Governance-Aware Software Architecture for Multi-Stakeholder Platforms

[Paper] What Breaks When LLMs Code? Characterizing Operational Safety Failures of Agentic Code Assistants

[Paper] FASR: Automated Identification of Unsafe Control Actions in STPA