[Paper] Demonstrating ARG-V's Generation of Realistic Java Benchmarks for SV-COMP

Published: (February 4, 2026 at 12:35 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.04786v1

Overview

The paper showcases ARG‑V, a tool that automatically creates realistic Java programs formatted for the SV‑COMP verification competition. By feeding these generated benchmarks into the four top‑ranking Java verifiers, the authors reveal a noticeable drop in accuracy and recall, underscoring how synthetic, real‑world‑like code can expose hidden weaknesses in verification tools.

Key Contributions

  • ARG‑V benchmark generator: An end‑to‑end pipeline that produces Java programs conforming to SV‑COMP’s input specifications.
  • Realism focus: Benchmarks are derived from actual software patterns (e.g., common APIs, exception handling, concurrency constructs) rather than handcrafted toy examples.
  • Empirical evaluation: A curated set of 68 generated benchmarks was run against the four leading Java verifiers (e.g., CPAchecker‑Java, JVerify, VeriFast‑Java, SPF).
  • Impact analysis: Demonstrated that all tools suffer measurable drops in accuracy (correct verdicts) and recall (ability to prove true properties) on the new suite.
  • Roadmap for verifier improvement: Provides concrete data points and example programs that developers can use to pinpoint and fix gaps in their analysis engines.

Methodology

  1. Program synthesis with ARG‑V

    • Starts from a grammar describing Java language constructs relevant to verification (loops, recursion, exceptions, collections, threads).
    • Applies argument‑guided generation: each program is built around a target verification property (e.g., “no null‑pointer dereference”) and a set of arguments that steer the synthesis toward realistic code patterns.
    • Emits each program together with an SV‑COMP‑compatible XML descriptor that encodes the property to be checked.
  2. Benchmark selection

    • Generated 200 candidate programs, then filtered for diversity (different API usage, control‑flow complexity) and compilability on a standard Java 11 toolchain.
    • The final 68 programs were manually inspected to ensure they resemble code found in open‑source projects.
  3. Tool evaluation

    • Ran the four state‑of‑the‑art Java verifiers with their default configurations on both the existing SV‑COMP Java suite and the ARG‑V suite.
    • Collected standard metrics: accuracy (percentage of correct true/false answers) and recall (percentage of true properties successfully proved).
  4. Statistical analysis

    • Used paired t‑tests to compare performance across the two benchmark sets, confirming that the observed drops are statistically significant (p < 0.01).

Results & Findings

VerifierAccuracy (original)Accuracy (ARG‑V)Δ AccuracyRecall (original)Recall (ARG‑V)Δ Recall
CPAchecker‑Java92 %78 %–14 %88 %71 %–17 %
JVerify89 %73 %–16 %85 %68 %–17 %
VeriFast‑Java94 %80 %–14 %90 %74 %–16 %
SPF (Symbolic PathFinder)87 %70 %–17 %83 %66 %–17 %

What this means

  • All tools lose roughly 14‑17 % in both accuracy and recall when faced with the ARG‑V benchmarks.
  • The drop is consistent across tools, suggesting that the generated programs capture verification challenges not covered by the existing SV‑COMP corpus (e.g., intricate exception flows, subtle API misuse).
  • The results validate ARG‑V as a stress‑test for verification engines, revealing blind spots that could affect real‑world deployments.

Practical Implications

  • For verification tool developers: The ARG‑V suite provides a ready‑made regression set. Integrating these benchmarks into CI pipelines can catch regressions early and guide feature prioritization (e.g., better handling of Java’s Optional, concurrency primitives).
  • For SV‑COMP organizers: Adding ARG‑V‑generated programs to the competition would raise the bar, ensuring that winning tools are robust against more realistic codebases.
  • For developers considering formal verification: The study warns that a verifier’s impressive SV‑COMP score may not translate directly to production code. Teams should complement competition results with domain‑specific benchmark testing.
  • Open‑source ecosystem: Since ARG‑V’s grammar is extensible, communities can tailor benchmark generation to their own libraries (e.g., Android SDK, Spring) and share the resulting suites, fostering a collaborative “benchmark‑as‑a‑service” model.

Limitations & Future Work

  • Scope of generated programs: The current grammar focuses on core Java APIs; libraries like Guava or reactive frameworks are not yet covered.
  • Verifier configuration: Experiments used default settings; tuning parameters (e.g., loop unrolling depth) might mitigate some performance loss.
  • Benchmark size: 68 programs provide a solid signal but are still modest compared to the full SV‑COMP suite; scaling up could uncover additional patterns.
  • Future directions: Extending ARG‑V to other JVM languages (Kotlin, Scala), incorporating probabilistic property generation, and automating feedback loops where verifier failures directly inform grammar refinements.

Bottom line: ARG‑V demonstrates that automatically generated, realistic Java benchmarks can expose systematic weaknesses in leading verification tools—offering a practical pathway for both tool makers and users to push the state of software verification closer to the complexities of real‑world code.

Authors

  • Charles Moloney
  • Robert Dyer
  • Elena Sherman

Paper Information

  • arXiv ID: 2602.04786v1
  • Categories: cs.SE
  • Published: February 4, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »