[Paper] CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation

Published: (April 14, 2026 at 12:31 AM EDT)
5 min read
Source: arXiv

Source: arXiv - 2604.12268v1

Overview

The paper introduces CodeSpecBench, a new benchmark that measures how well large language models (LLMs) can generate executable behavioral specifications—pre‑ and post‑conditions written as Python functions that can be run to check a program’s behavior. By shifting the focus from “does the model write code?” to “does the model understand what the code should do?”, the authors expose a gap in current LLM evaluation practices.

Key Contributions

  • Executable Specification Benchmark – A curated suite of real‑world functions and whole‑repository tasks where the target output is a runnable Python specification rather than source code.
  • Execution‑Based Evaluation Protocol – Correctness is measured by actually running the generated spec against a set of valid and invalid inputs, yielding pass rates that capture both acceptance of correct behavior and rejection of incorrect behavior.
  • Dual Granularity – Supports both function‑level (single API) and repository‑level (multiple inter‑dependent functions) scenarios, exposing scalability challenges.
  • Comprehensive Empirical Study – Benchmarks 15 state‑of‑the‑art LLMs (including GPT‑4, Claude, LLaMA‑2, CodeLlama, etc.) and reports a steep drop in performance on larger, repository‑wide tasks (best pass rate ≈ 20 %).
  • Open‑Source Release – All data, evaluation scripts, and baseline results are publicly available, encouraging reproducibility and future extensions.

Methodology

  1. Data Collection – The authors mined several open‑source Python projects (e.g., data‑science libraries, web frameworks) and extracted functions together with their natural‑language docstrings.
  2. Specification Generation – For each function, a reference specification was hand‑crafted: a Python function that asserts preconditions on inputs and postconditions on outputs.
  3. Prompt Design – LLMs receive the original docstring (or a short description) and are asked to output a specification in the same executable format.
  4. Execution‑Based Scoring
    • A test harness generates a mix of valid and invalid input tuples.
    • The generated spec is executed; it should return True for valid cases and raise an AssertionError (or return False) for invalid ones.
    • The pass rate = (# correctly accepted + # correctly rejected) / total test cases.
  5. Task Levels
    • Function‑level: single, isolated function.
    • Repository‑level: a set of functions that call each other, requiring the model to reason about cross‑function contracts.

Results & Findings

ModelFunction‑level Pass %Repository‑level Pass %
GPT‑4 (Chat)68.420.2
Claude 261.118.7
CodeLlama 34B45.312.4
LLaMA‑2 13B32.08.9
… (others)<30<10
  • Sharp Drop at Scale – Even the strongest model loses more than two‑thirds of its accuracy when moving from a single function to a full repository.
  • Specification vs. Code Generation – Models that achieve >80 % pass rates on traditional code‑generation benchmarks (e.g., HumanEval) still struggle to exceed 30 % on specification generation, indicating that “writing code” ≠ “understanding semantics”.
  • Error Patterns – Common failures include missing preconditions (e.g., neglecting None checks), overly permissive postconditions, and inability to capture invariants that span multiple functions.

Practical Implications

  • Better QA for AI‑Generated Code – Integrating executable specs into CI pipelines could automatically catch semantic mismatches that unit tests miss, raising the safety bar for LLM‑assisted development.
  • Contract‑Driven Development – Developers can prompt LLMs to produce design‑by‑contract artifacts (pre/post conditions, type guards) alongside code, accelerating documentation and defensive programming.
  • Model Selection & Fine‑Tuning – Benchmarks like CodeSpecBench give product teams a more nuanced metric when choosing a coding assistant: a model that scores high on code generation may still need fine‑tuning for semantic fidelity.
  • Tooling for Specification Synthesis – IDE extensions could suggest executable specs in real time, turning natural‑language comments into runnable contracts that developers can edit and validate instantly.

Limitations & Future Work

  • Language Scope – The benchmark currently targets Python; extending to statically typed languages (Java, Rust) would test spec generation under richer type systems.
  • Specification Expressiveness – Only pre‑/post‑conditions are considered; richer formalism (e.g., temporal properties, loop invariants) remains unexplored.
  • Test Input Coverage – The evaluation relies on a finite set of generated inputs; adversarial or edge‑case inputs could reveal additional weaknesses.
  • Human‑Authored Baselines – While reference specs are hand‑crafted, the study does not compare LLM output against specs written by professional developers under the same prompt constraints.

CodeSpecBench opens a new front in LLM evaluation—moving from “does it compile?” to “does it behave as intended?”—and provides a concrete path for developers to demand deeper semantic understanding from AI coding assistants.

Authors

  • Zaoyu Chen
  • Jianbo Dai
  • Boyu Zhu
  • Jingdong Wang
  • Huiming Wang
  • Xin Xu
  • Haoyang Yuan
  • Zhijiang Guo
  • Xiao-Ming Wu

Paper Information

  • arXiv ID: 2604.12268v1
  • Categories: cs.SE, cs.CL
  • Published: April 14, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »