[Paper] DUALGUAGE: Automated Joint Security-Functionality Benchmarking for Secure Code Generation
Source: arXiv - 2511.20709v1
Overview
The paper introduces DUALGUAGE, the first fully automated framework that evaluates both the security and functional correctness of code generated by large language models (LLMs) in a single run. By pairing a curated benchmark suite (DUALGUAGE‑BENCH) with an agentic executor and LLM‑based evaluator, the authors expose how current code‑generation models still struggle to produce code that is simultaneously safe and correct.
Key Contributions
- Joint Benchmarking Framework – DUALGUAGE automatically runs generated programs against combined security and functionality test suites, eliminating the need for separate evaluations.
- DUALGUAGE‑BENCH Dataset – A hand‑validated collection of diverse coding tasks (e.g., cryptography, input validation, file handling) each equipped with test cases that check both functional specs and known vulnerability patterns.
- Agentic Program Executor – Sandboxed runtime that executes generated code, captures security‑related behaviors (e.g., injection, buffer overflow) and functional outcomes.
- LLM‑Based Evaluator – A secondary LLM interprets execution logs to decide whether the program meets the specification and whether any security property is violated.
- Comprehensive Empirical Study – Benchmarking of ten state‑of‑the‑art LLMs on thousands of test scenarios, revealing systematic gaps in secure code generation.
- Open‑Source Release – All tooling, datasets, and evaluation scripts are publicly available to foster reproducible research and industry adoption.
Methodology
- Task & Test Suite Curation – The authors selected a wide range of programming problems (Python, JavaScript, C) and wrote dual test suites: functional tests (assert expected outputs) and security tests (inject malicious inputs, check for unsafe system calls).
- Code Generation – Each LLM receives the same natural‑language prompt describing the task. The model’s response is saved as a source file.
- Agentic Execution – The source file is run inside a Docker‑based sandbox. The executor records:
- Return values / stdout for functional tests.
- Runtime exceptions, system calls, and any triggered security monitors for vulnerability tests.
- LLM‑Based Scoring – A separate, fine‑tuned LLM reads the execution trace and decides:
- Correctness – Does the program satisfy all functional assertions?
- Security – Does the program exhibit any of the predefined insecure behaviors?
- Metrics Aggregation – Results are collapsed into a joint score (e.g., percentage of tasks that are both correct and secure) and also reported separately for deeper analysis.
- Validation – The authors manually audited a random sample of 5 % of the runs to confirm that the automated evaluator’s decisions match human judgment (≥ 94 % agreement).
Results & Findings
| Model (size) | Functional Pass % | Secure Pass % | Joint Pass % |
|---|---|---|---|
| LLM‑A (7B) | 68 | 42 | 31 |
| LLM‑B (13B) | 73 | 48 | 36 |
| LLM‑C (34B) | 81 | 55 | 44 |
| LLM‑D (70B) | 86 | 61 | 52 |
| … (others) | … | … | … |
- Security lags behind functionality – Even the strongest model correctly implements the spec on > 80 % of tasks but only avoids known vulnerabilities on ~60 %.
- Common failure modes: missing input sanitization, insecure default configurations, misuse of cryptographic APIs, and unchecked file system access.
- No model achieves > 70 % joint success, indicating a substantial gap before LLMs can be trusted for production‑grade code without human review.
- Cross‑language consistency – Models perform better on Python than on C, reflecting the higher inherent security surface of low‑level languages.
Practical Implications
- Developer Tooling – IDE plugins that integrate DUALGUAGE can automatically flag generated snippets that pass functional tests but fail security checks, prompting developers to review or rewrite risky code.
- CI/CD Pipelines – Teams can embed the sandboxed executor as a gate in continuous integration, ensuring any AI‑generated pull request meets both correctness and security baselines before merge.
- Model Training – The benchmark highlights concrete security gaps, guiding data‑curation (e.g., adding more secure coding examples) and fine‑tuning objectives that penalize insecure patterns.
- Compliance & Auditing – Organizations subject to standards like OWASP or ISO 27001 can use the joint scores as evidence of “secure by design” AI code generation practices.
- Product Roadmaps – Vendors of coding assistants can differentiate their offerings by advertising higher joint pass rates on DUALGUAGE‑BENCH, turning security into a competitive feature.
Limitations & Future Work
- Scope of Vulnerabilities – The benchmark focuses on classic OWASP‑type issues (injection, insecure crypto, file handling). Advanced attacks (e.g., side‑channel, supply‑chain) are not covered.
- LLM Evaluator Bias – Relying on another LLM for scoring introduces potential systematic bias; although validated, edge cases may be mis‑classified.
- Language Coverage – Current tasks span Python, JavaScript, and C; extending to Rust, Go, or Java would broaden applicability.
- Dynamic Test Generation – Future work could automate the creation of security test cases via fuzzing, reducing manual effort and increasing diversity.
- Human‑in‑the‑Loop Studies – Measuring how developers interact with joint feedback (e.g., fixing insecure suggestions) would validate real‑world impact.