[Paper] AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms
Source: arXiv
Source: arXiv - 2602.09464v1
Overview
The paper introduces AlgoVeri, a new benchmark that lets researchers and engineers compare how well AI systems can generate formally verified code for classic algorithms across three verification languages: Dafny, Verus, and Lean. By standardising the functional contracts for 77 well‑known algorithms, the authors expose where current models succeed or stumble, offering a clear picture of the state of “vericoding” – the automatic creation of provably correct software.
Key Contributions
- Unified benchmark covering 77 classical algorithms with identical specifications in Dafny, Verus, and Lean.
- Cross‑paradigm evaluation methodology that makes performance numbers directly comparable across languages and tools.
- Comprehensive empirical study of several frontier LLMs (e.g., Gemini‑3 Flash, GPT‑OSS), revealing large gaps between verification systems.
- Dynamic analysis of test‑time compute, showing how iterative “repair” loops can dramatically improve success rates for some models.
- Error taxonomy linking language‑design choices (SMT‑driven automation vs. explicit proof construction) to the types of failures AI models encounter.
- Open‑source release of data, prompts, and evaluation scripts (GitHub link in the paper).
Methodology
Algorithm selection – 77 staple algorithms (sorting, graph traversal, number‑theoretic routines, etc.) were chosen for their pedagogical value and well‑studied verification patterns.
Specification unification – For each algorithm, the authors wrote a single functional contract (pre‑conditions, post‑conditions, and invariants) and then translated it verbatim into the syntax of Dafny, Verus, and Lean, ensuring a fair “apples‑to‑apples” comparison.
Model prompting – LLMs were prompted with the specification and a brief description of the target language, then asked to generate the full implementation plus any required proof annotations.
Iterative repair – After an initial generation, the system automatically fed back compiler/verification errors to the model, allowing it to produce a corrected version (up to a fixed number of iterations).
Success criteria – An attempt is counted as successful if the generated code compiles and the verifier proves the contract without manual intervention.
Metrics – Recorded metrics include:
- Overall pass rate
- Per‑language pass rate
- Compute‑time curves (how success improves with more repair iterations)
Results & Findings
| Model (max. tokens) | Dafny Pass Rate | Verus Pass Rate | Lean Pass Rate |
|---|---|---|---|
| Gemini‑3 Flash | 40.3 % | 24.7 % | 7.8 % |
| GPT‑OSS (4‑shot) | 28.1 % | 12.4 % | 3.2 % |
| Claude‑2 | 22.5 % | 9.8 % | 2.9 % |
Key observations
- Dafny shines – its high‑level abstractions and built‑in SMT automation let models concentrate on logical correctness rather than low‑level proof details.
- Verus performance drops – models struggle with the systems‑level memory model and more intricate ownership annotations.
- Lean lags far behind – the need for explicit, step‑by‑step proof terms overwhelms current LLMs.
- Iterative repair matters – Gemini‑3’s Dafny pass rate triples after three repair cycles, whereas GPT‑OSS plateaus after the first iteration.
- Error patterns differ – in Dafny most failures are logical (incorrect invariants), while in Verus and Lean many attempts abort early due to syntax or type‑checking errors that the model cannot recover from.
Practical Implications
- Tool selection for AI‑assisted development – If you need AI‑generated, verified code today, Dafny (or similar SMT‑backed languages) offers the most realistic path.
- Designing verification languages – The benchmark suggests that language designers should provide higher‑level proof automation if they want to leverage LLMs effectively.
- Iterative “repair‑loop” pipelines – Building a feedback loop that feeds verifier errors back to the model can dramatically improve success; this pattern could be baked into IDE extensions or CI pipelines.
- Benchmark‑driven product roadmaps – Companies building AI coding assistants can use AlgoVeri to pinpoint where their models need better reasoning about memory safety, ownership, or explicit proof construction.
- Educational tooling – AlgoVeri can serve as a training ground for students learning both formal verification and prompt engineering, highlighting the interplay between specification clarity and model performance.
Limitations & Future Work
- Scope of algorithms – While the benchmark includes 77 classic algorithms, they represent only a small, well‑studied subset. Real‑world codebases contain far more complex data structures and concurrency patterns.
- Language coverage – The study evaluates only three verification languages. Extending the benchmark to Coq, Isabelle, or newer Rust‑based verification frameworks would broaden insights.
- Model size & prompting – Experiments used a handful of publicly known models; newer, larger, or instruction‑tuned models could shift the performance landscape.
- Repair budget – The number of allowed repair iterations was fixed. Exploring adaptive budgets or more sophisticated error‑analysis could yield higher success rates.
- Human‑in‑the‑loop evaluation – The current metric is binary (pass/fail). Future work could assess the quality of generated proofs, readability, and maintainability from a developer’s perspective.
AlgoVeri opens the door to systematic, cross‑language assessment of AI‑driven verified code generation, giving developers a concrete way to gauge how close we are to truly trustworthy, automatically produced software.
Authors
- Haoyu Zhao
- Ziran Yang
- Jiawei Li
- Deyuan He
- Zenan Li
- Chi Jin
- Venugopal V. Veeravalli
- Aarti Gupta
- Sanjeev Arora
Paper Information
| Item | Details |
|---|---|
| arXiv ID | 2602.09464v1 |
| Categories | cs.SE, cs.AI, cs.CL |
| Published | February 10, 2026 |
| Download PDF |