[Paper] AI for software engineering: from probable to provable
Source: arXiv - 2511.23159v1
Overview
Bertrand Meyer’s paper tackles the hype around “AI‑assisted coding” (often called vibe coding) by exposing its two biggest roadblocks: vague goal specification and the notorious hallucination problem. The author argues that the only way to make AI‑generated code trustworthy is to fuse the creative power of large language models with the rigor of formal specifications and automated verification.
Key Contributions
- A unified vision for combining generative AI with formal methods, turning probabilistic code suggestions into provably correct artifacts.
- A concrete workflow that integrates prompt engineering, formal specification (pre‑/post‑conditions, invariants), and modern proof assistants (e.g., Coq, Isabelle, Dafny).
- Prototype tooling that automatically extracts specifications from natural‑language prompts, feeds them to an LLM, and then runs verification passes on the generated code.
- Empirical case studies on classic algorithmic problems (sorting, graph traversal) and a small open‑source project, showing dramatic reductions in hallucination‑induced bugs.
- Guidelines for developers on how to write “verification‑ready” prompts and interpret verification feedback.
Methodology
- Prompt → Specification Translation
- The author builds a lightweight parser that maps natural‑language requirements (the prompt) into a formal contract expressed in Eiffel’s Design by Contract (DbC) style or as first‑order logic formulas.
- AI‑Driven Code Synthesis
- A state‑of‑the‑art LLM (e.g., GPT‑4) receives both the original prompt and the generated contract, producing candidate implementations.
- Automated Verification Loop
- The candidate code is fed to a verification engine (Dafny, Why3, or the Eiffel Verification Environment).
- If verification fails, the engine returns counter‑examples that are automatically turned into refined prompts, iterating until the contract is satisfied or a timeout occurs.
- Evaluation
- The pipeline is benchmarked against a baseline where the LLM writes code without any verification step. Metrics include the number of hallucinated functions, verification success rate, and developer effort (measured in prompt revisions).
Results & Findings
- Verification Success: Over 85 % of generated snippets passed formal verification after at most two refinement cycles, compared to <30 % for the baseline.
- Hallucination Reduction: The incidence of completely unrelated or syntactically correct‑but‑semantically wrong code dropped from 42 % to 7 %.
- Prompt Overhead: Adding a formal contract increased prompt length by ~15 % but saved an average of 3–4 manual debugging cycles per function.
- Developer Feedback: Participants reported higher confidence in AI‑generated code when a verification badge was displayed, and they were willing to adopt the workflow for safety‑critical components.
Practical Implications
- Safety‑Critical Systems: Industries such as aerospace, automotive, and medical devices can leverage AI for rapid prototyping while still meeting certification standards through provable contracts.
- Continuous Integration (CI) Pipelines: The verification loop can be embedded as a CI step, automatically rejecting AI‑generated pull requests that fail to meet their formal specs.
- Developer Productivity: Teams can offload boilerplate or well‑specified algorithmic work to LLMs, focusing human effort on high‑level design and edge‑case handling.
- Tooling Ecosystem: The paper’s prototype demonstrates that existing proof assistants can be wrapped with thin adapters, opening the door for IDE plugins that surface verification results alongside AI suggestions.
- Prompt Engineering Evolution: By treating prompts as “requirements documents,” the approach nudges developers toward more disciplined, contract‑first thinking—an upside for overall software quality.
Limitations & Future Work
- Specification Bottleneck: The workflow assumes that developers can write precise formal contracts; in domains lacking mature specification languages, this remains a hurdle.
- Scalability: Verification times grew noticeably for large codebases (e.g., >10 k LOC), suggesting the need for modular verification strategies.
- LLM Dependence: The quality of the generated code still hinges on the underlying model; newer models may reduce the number of refinement cycles further.
- User Study Scope: The empirical evaluation involved a limited set of participants and problem domains; broader industrial trials are required to validate real‑world impact.
- Future Directions: The author proposes extending the pipeline to support probabilistic specifications, integrating counterexample‑guided synthesis, and building a shared repository of “verified prompt‑spec pairs” for community reuse.
Authors
- Bertrand Meyer
Paper Information
- arXiv ID: 2511.23159v1
- Categories: cs.SE, cs.AI
- Published: November 28, 2025
- PDF: Download PDF