[Paper] CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation

Published: 3 weeks ago (April 14, 2026 at 12:31 AM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.12268v1

Overview

The paper introduces CodeSpecBench, a new benchmark that measures how well large language models (LLMs) can generate executable behavioral specifications—pre‑ and post‑conditions written as Python functions that can be run to check a program’s behavior. By shifting the focus from “does the model write code?” to “does the model understand what the code should do?”, the authors expose a gap in current LLM evaluation practices.

Key Contributions

Executable Specification Benchmark – A curated suite of real‑world functions and whole‑repository tasks where the target output is a runnable Python specification rather than source code.
Execution‑Based Evaluation Protocol – Correctness is measured by actually running the generated spec against a set of valid and invalid inputs, yielding pass rates that capture both acceptance of correct behavior and rejection of incorrect behavior.
Dual Granularity – Supports both function‑level (single API) and repository‑level (multiple inter‑dependent functions) scenarios, exposing scalability challenges.
Comprehensive Empirical Study – Benchmarks 15 state‑of‑the‑art LLMs (including GPT‑4, Claude, LLaMA‑2, CodeLlama, etc.) and reports a steep drop in performance on larger, repository‑wide tasks (best pass rate ≈ 20 %).
Open‑Source Release – All data, evaluation scripts, and baseline results are publicly available, encouraging reproducibility and future extensions.

Methodology

Data Collection – The authors mined several open‑source Python projects (e.g., data‑science libraries, web frameworks) and extracted functions together with their natural‑language docstrings.
Specification Generation – For each function, a reference specification was hand‑crafted: a Python function that asserts preconditions on inputs and postconditions on outputs.
Prompt Design – LLMs receive the original docstring (or a short description) and are asked to output a specification in the same executable format.
Execution‑Based Scoring –
- A test harness generates a mix of valid and invalid input tuples.
- The generated spec is executed; it should return True for valid cases and raise an AssertionError (or return False) for invalid ones.
- The pass rate = (# correctly accepted + # correctly rejected) / total test cases.
Task Levels –
- Function‑level: single, isolated function.
- Repository‑level: a set of functions that call each other, requiring the model to reason about cross‑function contracts.

Results & Findings

Model	Function‑level Pass %	Repository‑level Pass %
GPT‑4 (Chat)	68.4	20.2
Claude 2	61.1	18.7
CodeLlama 34B	45.3	12.4
LLaMA‑2 13B	32.0	8.9
… (others)	<30	<10

Sharp Drop at Scale – Even the strongest model loses more than two‑thirds of its accuracy when moving from a single function to a full repository.
Specification vs. Code Generation – Models that achieve >80 % pass rates on traditional code‑generation benchmarks (e.g., HumanEval) still struggle to exceed 30 % on specification generation, indicating that “writing code” ≠ “understanding semantics”.
Error Patterns – Common failures include missing preconditions (e.g., neglecting None checks), overly permissive postconditions, and inability to capture invariants that span multiple functions.

Practical Implications

Better QA for AI‑Generated Code – Integrating executable specs into CI pipelines could automatically catch semantic mismatches that unit tests miss, raising the safety bar for LLM‑assisted development.
Contract‑Driven Development – Developers can prompt LLMs to produce design‑by‑contract artifacts (pre/post conditions, type guards) alongside code, accelerating documentation and defensive programming.
Model Selection & Fine‑Tuning – Benchmarks like CodeSpecBench give product teams a more nuanced metric when choosing a coding assistant: a model that scores high on code generation may still need fine‑tuning for semantic fidelity.
Tooling for Specification Synthesis – IDE extensions could suggest executable specs in real time, turning natural‑language comments into runnable contracts that developers can edit and validate instantly.

Limitations & Future Work

Language Scope – The benchmark currently targets Python; extending to statically typed languages (Java, Rust) would test spec generation under richer type systems.
Specification Expressiveness – Only pre‑/post‑conditions are considered; richer formalism (e.g., temporal properties, loop invariants) remains unexplored.
Test Input Coverage – The evaluation relies on a finite set of generated inputs; adversarial or edge‑case inputs could reveal additional weaknesses.
Human‑Authored Baselines – While reference specs are hand‑crafted, the study does not compare LLM output against specs written by professional developers under the same prompt constraints.

CodeSpecBench opens a new front in LLM evaluation—moving from “does it compile?” to “does it behave as intended?”—and provides a concrete path for developers to demand deeper semantic understanding from AI coding assistants.

Authors

Zaoyu Chen
Jianbo Dai
Boyu Zhu
Jingdong Wang
Huiming Wang
Xin Xu
Haoyang Yuan
Zhijiang Guo
Xiao-Ming Wu

Paper Information

arXiv ID: 2604.12268v1
Categories: cs.SE, cs.CL
Published: April 14, 2026
PDF: Download PDF

[Paper] CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Learning to Reason with Insight for Informal Theorem Proving

[Paper] No Universal Courtesy: A Cross-Linguistic, Multi-Model Study of Politeness Effects on LLMs Using the PLUM Corpus

[Paper] VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

[Paper] From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text