[Paper] CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation
Source: arXiv - 2604.12268v1
Overview
The paper introduces CodeSpecBench, a new benchmark that measures how well large language models (LLMs) can generate executable behavioral specifications—pre‑ and post‑conditions written as Python functions that can be run to check a program’s behavior. By shifting the focus from “does the model write code?” to “does the model understand what the code should do?”, the authors expose a gap in current LLM evaluation practices.
Key Contributions
- Executable Specification Benchmark – A curated suite of real‑world functions and whole‑repository tasks where the target output is a runnable Python specification rather than source code.
- Execution‑Based Evaluation Protocol – Correctness is measured by actually running the generated spec against a set of valid and invalid inputs, yielding pass rates that capture both acceptance of correct behavior and rejection of incorrect behavior.
- Dual Granularity – Supports both function‑level (single API) and repository‑level (multiple inter‑dependent functions) scenarios, exposing scalability challenges.
- Comprehensive Empirical Study – Benchmarks 15 state‑of‑the‑art LLMs (including GPT‑4, Claude, LLaMA‑2, CodeLlama, etc.) and reports a steep drop in performance on larger, repository‑wide tasks (best pass rate ≈ 20 %).
- Open‑Source Release – All data, evaluation scripts, and baseline results are publicly available, encouraging reproducibility and future extensions.
Methodology
- Data Collection – The authors mined several open‑source Python projects (e.g., data‑science libraries, web frameworks) and extracted functions together with their natural‑language docstrings.
- Specification Generation – For each function, a reference specification was hand‑crafted: a Python function that asserts preconditions on inputs and postconditions on outputs.
- Prompt Design – LLMs receive the original docstring (or a short description) and are asked to output a specification in the same executable format.
- Execution‑Based Scoring –
- A test harness generates a mix of valid and invalid input tuples.
- The generated spec is executed; it should return
Truefor valid cases and raise anAssertionError(or returnFalse) for invalid ones. - The pass rate = (# correctly accepted + # correctly rejected) / total test cases.
- Task Levels –
- Function‑level: single, isolated function.
- Repository‑level: a set of functions that call each other, requiring the model to reason about cross‑function contracts.
Results & Findings
| Model | Function‑level Pass % | Repository‑level Pass % |
|---|---|---|
| GPT‑4 (Chat) | 68.4 | 20.2 |
| Claude 2 | 61.1 | 18.7 |
| CodeLlama 34B | 45.3 | 12.4 |
| LLaMA‑2 13B | 32.0 | 8.9 |
| … (others) | <30 | <10 |
- Sharp Drop at Scale – Even the strongest model loses more than two‑thirds of its accuracy when moving from a single function to a full repository.
- Specification vs. Code Generation – Models that achieve >80 % pass rates on traditional code‑generation benchmarks (e.g., HumanEval) still struggle to exceed 30 % on specification generation, indicating that “writing code” ≠ “understanding semantics”.
- Error Patterns – Common failures include missing preconditions (e.g., neglecting
Nonechecks), overly permissive postconditions, and inability to capture invariants that span multiple functions.
Practical Implications
- Better QA for AI‑Generated Code – Integrating executable specs into CI pipelines could automatically catch semantic mismatches that unit tests miss, raising the safety bar for LLM‑assisted development.
- Contract‑Driven Development – Developers can prompt LLMs to produce design‑by‑contract artifacts (pre/post conditions, type guards) alongside code, accelerating documentation and defensive programming.
- Model Selection & Fine‑Tuning – Benchmarks like CodeSpecBench give product teams a more nuanced metric when choosing a coding assistant: a model that scores high on code generation may still need fine‑tuning for semantic fidelity.
- Tooling for Specification Synthesis – IDE extensions could suggest executable specs in real time, turning natural‑language comments into runnable contracts that developers can edit and validate instantly.
Limitations & Future Work
- Language Scope – The benchmark currently targets Python; extending to statically typed languages (Java, Rust) would test spec generation under richer type systems.
- Specification Expressiveness – Only pre‑/post‑conditions are considered; richer formalism (e.g., temporal properties, loop invariants) remains unexplored.
- Test Input Coverage – The evaluation relies on a finite set of generated inputs; adversarial or edge‑case inputs could reveal additional weaknesses.
- Human‑Authored Baselines – While reference specs are hand‑crafted, the study does not compare LLM output against specs written by professional developers under the same prompt constraints.
CodeSpecBench opens a new front in LLM evaluation—moving from “does it compile?” to “does it behave as intended?”—and provides a concrete path for developers to demand deeper semantic understanding from AI coding assistants.
Authors
- Zaoyu Chen
- Jianbo Dai
- Boyu Zhu
- Jingdong Wang
- Huiming Wang
- Xin Xu
- Haoyang Yuan
- Zhijiang Guo
- Xiao-Ming Wu
Paper Information
- arXiv ID: 2604.12268v1
- Categories: cs.SE, cs.CL
- Published: April 14, 2026
- PDF: Download PDF