[Paper] R+R: Reassessing Java Security API Misuse in Current LLMs: A Replication on JCA and JSSE APIs with External Security Knowledge
Source: arXiv - 2605.31135v1
Overview
A recent replication study revisits the alarming trend of Java security API misuse in code generated by large language models (LLMs). By testing the latest proprietary model (GPT‑5.5) and a cutting‑edge open‑weight model (Llama‑3.3‑70B‑Instruct) on the Java Cryptography Architecture (JCA) and Java Secure Socket Extension (JSSE) APIs, the authors show that while newer models are better, insecure code is still being produced—unless the model is supplied with targeted security knowledge.
Key Contributions
- Replication of a 2024 benchmark on JCA/JSSE misuse, now using GPT‑5.5 and Llama‑3.3‑70B‑Instruct.
- Systematic evaluation of external security knowledge (secure code snippets, misuse patterns, developer guides, and secure prompting) and its impact on model output.
- Discovery of model‑dependent effects:
- For Llama‑3.3‑70B‑Instruct, secure code examples are the single most powerful knowledge source.
- For GPT‑5.5, explicit misuse‑pattern prompts completely eliminate detectable misuses in syntactically valid programs.
- Quantitative analysis of residual problems: even the best‑performing model still emits compilation errors or API‑mismatch code when no external knowledge is supplied.
- Practical guidance on how to augment LLM‑driven code generation pipelines with retrieval‑augmented security knowledge.
Methodology
- Benchmark Construction – The authors reused the “Java Security API Misuse” benchmark from Mousavi et al., which contains 50 representative tasks covering common JCA/JSSE usage scenarios (key generation, cipher configuration, TLS socket setup, etc.).
- Model Selection – Two state‑of‑the‑art models were evaluated:
- GPT‑5.5 (closed‑source, accessed via OpenAI API).
- Llama‑3.3‑70B‑Instruct (open‑weight, run locally).
- Knowledge Injection Strategies – Four types of external security knowledge were retrieved from a curated knowledge base and appended to the prompt:
- Secure code examples (minimal, correct snippets).
- Misuse patterns (common pitfalls to avoid).
- Developer‑guide excerpts (official JCA/JSSE docs).
- Secure prompting (instructions to “produce security‑first code”).
- Evaluation Metrics – Generated programs were compiled, executed against a test harness, and classified as:
- Valid & Secure – compiles, runs, and follows best‑practice API usage.
- Valid but Insecure – compiles/runs but contains a known misuse.
- Invalid – fails to compile or targets the wrong API.
- Statistical Analysis – Results were aggregated per model, per knowledge type, and per combination to assess additive effects.
Results & Findings
| Model | Baseline (no knowledge) | + Secure Code | + Misuse Patterns | + Developer Guide | + Secure Prompt |
|---|---|---|---|---|---|
| GPT‑5.5 | 68 % valid, 42 % insecure | 81 % valid, 15 % insecure | 0 % insecure (all valid programs secure) | 78 % valid, 12 % insecure | 85 % valid, 8 % insecure |
| Llama‑3.3‑70B‑Instruct | 55 % valid, 48 % insecure | 90 % valid, 9 % insecure | 70 % valid, 30 % insecure | 65 % valid, 35 % insecure | 72 % valid, 22 % insecure |
- Overall improvement: Both models generate more secure code than the 2024 baseline, but misuse rates remain non‑trivial without augmentation.
- Model‑specific knowledge impact: Secure code examples dramatically help Llama‑3.3‑70B‑Instruct, while explicit misuse‑pattern prompts are decisive for GPT‑5.5.
- Residual errors: Even with the best knowledge injection, GPT‑5.5 still produces ~15 % compilation‑error outputs, and Llama‑3.3‑70B‑Instruct leaves ~9 % insecure snippets.
- Retrieval‑augmented prompting is not a silver bullet; its effectiveness hinges on the model’s internal reasoning capabilities.
Practical Implications
- Augment LLM‑powered IDE plugins – Integrate a lightweight retrieval layer that fetches secure JCA/JSSE examples or misuse patterns before sending the prompt to the model.
- Model‑aware knowledge selection – Choose the knowledge type that aligns with the model you’re using (e.g., prioritize secure snippets for open‑weight models, misuse‑pattern prompts for proprietary models).
- Automated post‑generation validation – Pair LLM output with static analysis tools (e.g., SpotBugs, FindSecBugs) to catch the remaining insecure or non‑compiling code.
- Self‑hosted deployments – For teams running Llama‑3.3‑70B‑Instruct locally, a modest knowledge base can bring security performance close to that of a closed‑source premium model, reducing reliance on costly APIs.
- Secure prompting best practices – Explicitly ask the model to “follow Java security best practices” and to “avoid deprecated algorithms”; this simple instruction yields measurable gains, especially for GPT‑5.5.
Limitations & Future Work
- Benchmark scope – The study focuses only on JCA and JSSE; other Java security libraries (e.g., Bouncy Castle, Spring Security) remain untested.
- Knowledge base quality – Results depend on the relevance and cleanliness of the retrieved snippets; noisy or outdated knowledge could degrade performance.
- Model versions – Rapid model updates may shift the balance of which knowledge type is most effective, requiring continuous re‑evaluation.
- Human‑in‑the‑loop – The experiments assume fully automated generation; real‑world developer interaction (editing, reviewing) could further mitigate misuse but was not modeled.
- Future directions – Extending the replication to other languages (Python, Go), exploring fine‑tuning with security‑focused data, and building end‑to‑end pipelines that combine retrieval, LLM generation, and static analysis in a single developer workflow.
Authors
- Tianhe Lu
- Eric Spero
- Sakuna Harinda Jayasundara
- Robert Biddle
- Giovanni Russello
Paper Information
- arXiv ID: 2605.31135v1
- Categories: cs.CR, cs.SE
- Published: May 29, 2026
- PDF: Download PDF