[Paper] AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms

Published: 2 months ago (February 10, 2026 at 01:58 AM EST)

5 min read

Source: arXiv

Source: arXiv

Overview

The paper introduces AlgoVeri, a new benchmark that lets researchers and engineers compare how well AI systems can generate formally verified code for classic algorithms across three verification languages: Dafny, Verus, and Lean. By standardising the functional contracts for 77 well‑known algorithms, the authors expose where current models succeed or stumble, offering a clear picture of the state of “vericoding” – the automatic creation of provably correct software.

Key Contributions

Unified benchmark covering 77 classical algorithms with identical specifications in Dafny, Verus, and Lean.
Cross‑paradigm evaluation methodology that makes performance numbers directly comparable across languages and tools.
Comprehensive empirical study of several frontier LLMs (e.g., Gemini‑3 Flash, GPT‑OSS), revealing large gaps between verification systems.
Dynamic analysis of test‑time compute, showing how iterative “repair” loops can dramatically improve success rates for some models.
Error taxonomy linking language‑design choices (SMT‑driven automation vs. explicit proof construction) to the types of failures AI models encounter.
Open‑source release of data, prompts, and evaluation scripts (GitHub link in the paper).

Methodology

Algorithm selection – 77 staple algorithms (sorting, graph traversal, number‑theoretic routines, etc.) were chosen for their pedagogical value and well‑studied verification patterns.
Specification unification – For each algorithm, the authors wrote a single functional contract (pre‑conditions, post‑conditions, and invariants) and then translated it verbatim into the syntax of Dafny, Verus, and Lean, ensuring a fair “apples‑to‑apples” comparison.
Model prompting – LLMs were prompted with the specification and a brief description of the target language, then asked to generate the full implementation plus any required proof annotations.
Iterative repair – After an initial generation, the system automatically fed back compiler/verification errors to the model, allowing it to produce a corrected version (up to a fixed number of iterations).
Success criteria – An attempt is counted as successful if the generated code compiles and the verifier proves the contract without manual intervention.
Metrics – Recorded metrics include:
- Overall pass rate
- Per‑language pass rate
- Compute‑time curves (how success improves with more repair iterations)

Results & Findings

Model (max. tokens)	Dafny Pass Rate	Verus Pass Rate	Lean Pass Rate
Gemini‑3 Flash	40.3 %	24.7 %	7.8 %
GPT‑OSS (4‑shot)	28.1 %	12.4 %	3.2 %
Claude‑2	22.5 %	9.8 %	2.9 %

Key observations

Dafny shines – its high‑level abstractions and built‑in SMT automation let models concentrate on logical correctness rather than low‑level proof details.
Verus performance drops – models struggle with the systems‑level memory model and more intricate ownership annotations.
Lean lags far behind – the need for explicit, step‑by‑step proof terms overwhelms current LLMs.
Iterative repair matters – Gemini‑3’s Dafny pass rate triples after three repair cycles, whereas GPT‑OSS plateaus after the first iteration.
Error patterns differ – in Dafny most failures are logical (incorrect invariants), while in Verus and Lean many attempts abort early due to syntax or type‑checking errors that the model cannot recover from.

Practical Implications

Tool selection for AI‑assisted development – If you need AI‑generated, verified code today, Dafny (or similar SMT‑backed languages) offers the most realistic path.
Designing verification languages – The benchmark suggests that language designers should provide higher‑level proof automation if they want to leverage LLMs effectively.
Iterative “repair‑loop” pipelines – Building a feedback loop that feeds verifier errors back to the model can dramatically improve success; this pattern could be baked into IDE extensions or CI pipelines.
Benchmark‑driven product roadmaps – Companies building AI coding assistants can use AlgoVeri to pinpoint where their models need better reasoning about memory safety, ownership, or explicit proof construction.
Educational tooling – AlgoVeri can serve as a training ground for students learning both formal verification and prompt engineering, highlighting the interplay between specification clarity and model performance.

Limitations & Future Work

Scope of algorithms – While the benchmark includes 77 classic algorithms, they represent only a small, well‑studied subset. Real‑world codebases contain far more complex data structures and concurrency patterns.
Language coverage – The study evaluates only three verification languages. Extending the benchmark to Coq, Isabelle, or newer Rust‑based verification frameworks would broaden insights.
Model size & prompting – Experiments used a handful of publicly known models; newer, larger, or instruction‑tuned models could shift the performance landscape.
Repair budget – The number of allowed repair iterations was fixed. Exploring adaptive budgets or more sophisticated error‑analysis could yield higher success rates.
Human‑in‑the‑loop evaluation – The current metric is binary (pass/fail). Future work could assess the quality of generated proofs, readability, and maintainability from a developer’s perspective.

AlgoVeri opens the door to systematic, cross‑language assessment of AI‑driven verified code generation, giving developers a concrete way to gauge how close we are to truly trustworthy, automatically produced software.

Authors

Haoyu Zhao
Ziran Yang
Jiawei Li
Deyuan He
Zenan Li
Chi Jin
Venugopal V. Veeravalli
Aarti Gupta
Sanjeev Arora

Paper Information

Item	Details
arXiv ID	`2602.09464v1`
Categories	`cs.SE`, `cs.AI`, `cs.CL`
Published	February 10, 2026
PDF	Download PDF

[Paper] AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms

Overview

Key Contributions

Methodology

Results & Findings

Key observations

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

GPT-5 outperforms federal judges 100% to 52% in legal reasoning experiment

[Paper] FormalJudge: A Neuro-Symbolic Paradigm for Agentic Oversight

Show HN: Scanned 1927-1945 Daily USFS Work Diary

Are 'Agent Skills' the Secret Sauce for AI Productivity?