[Paper] VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean

Published: (February 20, 2026 at 11:05 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.18307v1

Overview

The paper introduces VeriSoftBench, a new benchmark suite of 500 Lean 4 proof obligations harvested from real‑world open‑source software‑verification projects. Unlike existing LLM‑driven theorem‑proving benchmarks that focus on pure mathematics (e.g., Mathlib), this dataset preserves the full repository context—cross‑file imports, custom libraries, and deep dependency graphs—making it a more realistic testbed for both language‑model‑based and traditional automated provers.

Key Contributions

  • Repository‑scale benchmark: 500 Lean 4 verification tasks drawn from diverse, publicly available formal‑methods codebases.
  • Context‑preserving packaging: Each task is bundled with its exact transitive dependency closure, enabling reproducible experiments that reflect real development environments.
  • Comprehensive evaluation: Front‑line large language models (e.g., GPT‑4, Claude) and specialized provers are benchmarked, revealing systematic gaps when moving from Mathlib‑style math to software‑verification code.
  • Empirical insights: Three core observations about transferability, dependency complexity, and the value of curated context for proof automation.
  • Open‑source release: Benchmark data, evaluation scripts, and baseline results are publicly available on GitHub.

Methodology

  1. Data collection – The authors mined popular Lean 4 verification repositories (e.g., verification of compilers, operating‑system kernels, and security‑critical libraries). They extracted individual theorem/lemma statements that were not already proved automatically.
  2. Dependency extraction – For each proof obligation, a static analysis pass identified all imported modules, definitions, and lemmas required to type‑check the statement. The full transitive closure was packaged as a self‑contained Lean project.
  3. Benchmark design – Tasks were split into three difficulty tiers based on the size of their dependency closure (small, medium, large).
  4. Evaluation pipeline
    • LLM‑based provers: Prompts were constructed to feed the target statement plus either the whole repository, the curated dependency closure, or no extra context. The LLM generated proof scripts which were then type‑checked by Lean.
    • Specialized provers: Tools such as leanprover-community/lean4’s simp, omega, and external SAT/SMT back‑ends were run with identical context settings.
  5. Metrics – Success rate (proof fully type‑checked), time‑to‑proof, and token usage for LLM prompts were recorded.

Results & Findings

SettingSuccess Rate (overall)Success on Small depsSuccess on Large deps
Mathlib‑tuned provers (baseline)12 %22 %4 %
Frontier LLM (full repo)18 %30 %6 %
Frontier LLM (curated closure)27 %38 %12 %
Specialized non‑LLM provers (full repo)15 %25 %5 %
  • Transfer gap: Tools that excel on Mathlib proofs drop dramatically when faced with repository‑centric verification tasks.
  • Dependency depth matters: Proofs that require many transitive imports are roughly three times less likely to be solved.
  • Curated context helps: Supplying only the minimal dependency closure boosts LLM performance by ~50 % relative to the full repository, but the absolute success rate remains modest, indicating ample room for smarter context selection or reasoning mechanisms.

Practical Implications

  • Tooling for industry‑scale verification: Companies building formally verified software (e.g., safety‑critical drivers, cryptographic libraries) can no longer rely on “Mathlib‑only” benchmarks to gauge the readiness of LLM‑assisted proof automation. VeriSoftBench offers a realistic yardstick.
  • Context management APIs: The findings suggest that IDEs and CI pipelines should incorporate automated dependency‑pruning utilities that feed provers only the necessary Lean files, reducing token costs for LLM APIs and improving latency.
  • Hybrid prover architectures: The modest gains from curated context hint that a two‑stage approach—first a fast static analyzer to compute a minimal closure, then an LLM or SMT backend—could become a standard pattern in verification pipelines.
  • Benchmark‑driven research: Researchers developing new prompting strategies, retrieval‑augmented generation, or domain‑specific fine‑tuning can now evaluate against a dataset that mirrors the complexity of production codebases, accelerating progress toward usable verification assistants.

Limitations & Future Work

  • Domain coverage: While 500 tasks span several verification projects, the benchmark still leans toward academic open‑source repos; industrial proprietary code may exhibit even richer dependency graphs.
  • Static dependency extraction: The current approach assumes all needed imports are statically discoverable; dynamic meta‑programming patterns in Lean could hide additional requirements.
  • LLM prompting scope: Only a handful of leading LLMs were tested; future work could explore chain‑of‑thought prompting, tool‑use extensions (e.g., calling Lean’s simp automatically), or fine‑tuning on verification‑specific corpora.
  • Metric breadth: Success rate is binary; richer metrics such as proof length, human‑readability, or downstream maintenance cost were not measured.

The authors plan to expand VeriSoftBench with more projects, add multi‑language support (e.g., Coq, Isabelle), and provide a leaderboard to track progress across the community.

Authors

  • Yutong Xin
  • Qiaochu Chen
  • Greg Durrett
  • Işil Dillig

Paper Information

  • arXiv ID: 2602.18307v1
  • Categories: cs.SE, cs.CL, cs.LG, cs.PL
  • Published: February 20, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »