[Paper] Crash-free Deductive Verifiers

Published: (April 21, 2026 at 09:29 AM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.19448v1

Overview

Deductive verifiers are becoming powerful enough to be used beyond their original developer circles, but they still suffer from reliability problems that can scare off new users. This paper proposes a lightweight, practical solution: fuzz testing the verifiers themselves. By treating a verifier as a black‑box program and feeding it a stream of automatically generated, malformed, or edge‑case inputs, the authors show how to surface crashes, misleading error messages, and other robustness issues before they reach end‑users.

Key Contributions

  • Fuzz‑testing framework for deductive verifiers – a concrete methodology that adapts classic coverage‑guided fuzzing to the peculiarities of verification tools.
  • Prototype implementation (AValAnCHE) – integrated with the VerCors verifier, demonstrating that fuzzing can be added with modest engineering effort.
  • Empirical evaluation – the prototype uncovered multiple bugs in VerCors (including crashes, incorrect diagnostics, and hidden unsoundness) and showed that the same approach works on other state‑of‑the‑art verifiers.
  • Guidelines for practitioners – a set‑of‑best‑practices for instrumenting, seeding, and interpreting fuzzing results for verification tools.

Methodology

  1. Input Model – The verifier’s command‑line interface (or API) is treated as the fuzzing entry point. Inputs consist of program files, annotation files, and configuration flags.
  2. Instrumentation – Minimal wrappers log crashes, timeouts, and abnormal exit codes without needing to modify the verifier’s internals.
  3. Seed Corpus – A small collection of real verification tasks (e.g., small Java/Scala programs with JML/VerCors annotations) seeds the fuzzer.
  4. Mutation Engine – Standard coverage‑guided fuzzers (e.g., AFL, libFuzzer) are extended to understand the syntax of source files, allowing them to inject syntactic noise, delete annotations, or scramble command‑line options.
  5. Oracle – The “oracle” is simple: any verifier crash, assertion failure inside the verifier, or an unexpected “verification succeeded/failed” message is flagged for manual inspection.

The process is fully automated: the fuzzer runs continuously, reports crashes, and the developers triage the findings.

Results & Findings

  • Crash detection – AValAnCHE triggered 12 distinct crashes in VerCors across a 48‑hour fuzzing campaign, many of which were reproducible with a single‑line test case.
  • Misleading diagnostics – The fuzzer exposed 7 cases where VerCors reported a verification failure for a correct program, revealing hidden unsoundness in the handling of certain annotation patterns.
  • Cross‑tool applicability – Applying the same harness to the VeriFast and KeY verifiers uncovered 3 additional bugs (2 crashes, 1 spurious error), confirming that the approach is not verifier‑specific.
  • Low overhead – Instrumentation added < 5 % runtime overhead, and the fuzzing campaign required only modest compute resources (a single modern CPU core).

These findings demonstrate that even mature verifiers can harbor critical robustness issues that are invisible to their own test suites.

Practical Implications

  • Higher confidence for adopters – Organizations can integrate fuzzing into their CI pipelines for verification tools, catching regressions before they affect developers.
  • Better user experience – By eliminating crashes and improving error messages, verifiers become more approachable for engineers who are not verification experts.
  • Accelerated tool evolution – Early detection of edge‑case bugs reduces the time developers spend debugging their own verification pipelines, allowing them to focus on extending the verifier’s capabilities.
  • Standardized quality gate – The community could adopt fuzz‑testing as a de‑facto quality gate, similar to how compilers are routinely fuzzed, raising the overall reliability of the verification ecosystem.

Limitations & Future Work

  • Coverage gaps – Fuzzing explores only the exercised code paths; deep semantic bugs that require specific logical inputs may remain hidden.
  • Oracle simplicity – The current oracle flags any crash or unexpected result, but more nuanced correctness checks (e.g., comparing against a trusted baseline) are needed for thorough soundness validation.
  • Scalability to large codebases – While effective on small benchmark programs, scaling the approach to massive industrial codebases may require smarter seeding strategies and domain‑specific mutators.
  • Tool‑specific adaptations – Each verifier has its own input language and configuration quirks; future work should provide a reusable library that abstracts these differences, making fuzzing plug‑and‑play across the verification landscape.

Authors

  • Wander Nauta
  • Marcus Gerhold
  • Marieke Huisman

Paper Information

  • arXiv ID: 2604.19448v1
  • Categories: cs.SE
  • Published: April 21, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »