[Paper] AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms

Published: (February 10, 2026 at 01:58 AM EST)
5 min read
Source: arXiv

Source: arXiv

Source: arXiv - 2602.09464v1

Overview

The paper introduces AlgoVeri, a new benchmark that lets researchers and engineers compare how well AI systems can generate formally verified code for classic algorithms across three verification languages: Dafny, Verus, and Lean. By standardising the functional contracts for 77 well‑known algorithms, the authors expose where current models succeed or stumble, offering a clear picture of the state of “vericoding” – the automatic creation of provably correct software.

Key Contributions

  • Unified benchmark covering 77 classical algorithms with identical specifications in Dafny, Verus, and Lean.
  • Cross‑paradigm evaluation methodology that makes performance numbers directly comparable across languages and tools.
  • Comprehensive empirical study of several frontier LLMs (e.g., Gemini‑3 Flash, GPT‑OSS), revealing large gaps between verification systems.
  • Dynamic analysis of test‑time compute, showing how iterative “repair” loops can dramatically improve success rates for some models.
  • Error taxonomy linking language‑design choices (SMT‑driven automation vs. explicit proof construction) to the types of failures AI models encounter.
  • Open‑source release of data, prompts, and evaluation scripts (GitHub link in the paper).

Methodology

  1. Algorithm selection – 77 staple algorithms (sorting, graph traversal, number‑theoretic routines, etc.) were chosen for their pedagogical value and well‑studied verification patterns.

  2. Specification unification – For each algorithm, the authors wrote a single functional contract (pre‑conditions, post‑conditions, and invariants) and then translated it verbatim into the syntax of Dafny, Verus, and Lean, ensuring a fair “apples‑to‑apples” comparison.

  3. Model prompting – LLMs were prompted with the specification and a brief description of the target language, then asked to generate the full implementation plus any required proof annotations.

  4. Iterative repair – After an initial generation, the system automatically fed back compiler/verification errors to the model, allowing it to produce a corrected version (up to a fixed number of iterations).

  5. Success criteria – An attempt is counted as successful if the generated code compiles and the verifier proves the contract without manual intervention.

  6. Metrics – Recorded metrics include:

    • Overall pass rate
    • Per‑language pass rate
    • Compute‑time curves (how success improves with more repair iterations)

Results & Findings

Model (max. tokens)Dafny Pass RateVerus Pass RateLean Pass Rate
Gemini‑3 Flash40.3 %24.7 %7.8 %
GPT‑OSS (4‑shot)28.1 %12.4 %3.2 %
Claude‑222.5 %9.8 %2.9 %

Key observations

  • Dafny shines – its high‑level abstractions and built‑in SMT automation let models concentrate on logical correctness rather than low‑level proof details.
  • Verus performance drops – models struggle with the systems‑level memory model and more intricate ownership annotations.
  • Lean lags far behind – the need for explicit, step‑by‑step proof terms overwhelms current LLMs.
  • Iterative repair matters – Gemini‑3’s Dafny pass rate triples after three repair cycles, whereas GPT‑OSS plateaus after the first iteration.
  • Error patterns differ – in Dafny most failures are logical (incorrect invariants), while in Verus and Lean many attempts abort early due to syntax or type‑checking errors that the model cannot recover from.

Practical Implications

  • Tool selection for AI‑assisted development – If you need AI‑generated, verified code today, Dafny (or similar SMT‑backed languages) offers the most realistic path.
  • Designing verification languages – The benchmark suggests that language designers should provide higher‑level proof automation if they want to leverage LLMs effectively.
  • Iterative “repair‑loop” pipelines – Building a feedback loop that feeds verifier errors back to the model can dramatically improve success; this pattern could be baked into IDE extensions or CI pipelines.
  • Benchmark‑driven product roadmaps – Companies building AI coding assistants can use AlgoVeri to pinpoint where their models need better reasoning about memory safety, ownership, or explicit proof construction.
  • Educational tooling – AlgoVeri can serve as a training ground for students learning both formal verification and prompt engineering, highlighting the interplay between specification clarity and model performance.

Limitations & Future Work

  • Scope of algorithms – While the benchmark includes 77 classic algorithms, they represent only a small, well‑studied subset. Real‑world codebases contain far more complex data structures and concurrency patterns.
  • Language coverage – The study evaluates only three verification languages. Extending the benchmark to Coq, Isabelle, or newer Rust‑based verification frameworks would broaden insights.
  • Model size & prompting – Experiments used a handful of publicly known models; newer, larger, or instruction‑tuned models could shift the performance landscape.
  • Repair budget – The number of allowed repair iterations was fixed. Exploring adaptive budgets or more sophisticated error‑analysis could yield higher success rates.
  • Human‑in‑the‑loop evaluation – The current metric is binary (pass/fail). Future work could assess the quality of generated proofs, readability, and maintainability from a developer’s perspective.

AlgoVeri opens the door to systematic, cross‑language assessment of AI‑driven verified code generation, giving developers a concrete way to gauge how close we are to truly trustworthy, automatically produced software.

Authors

  • Haoyu Zhao
  • Ziran Yang
  • Jiawei Li
  • Deyuan He
  • Zenan Li
  • Chi Jin
  • Venugopal V. Veeravalli
  • Aarti Gupta
  • Sanjeev Arora

Paper Information

ItemDetails
arXiv ID2602.09464v1
Categoriescs.SE, cs.AI, cs.CL
PublishedFebruary 10, 2026
PDFDownload PDF
0 views
Back to Blog

Related posts

Read more »