[Paper] Demonstrating ARG-V's Generation of Realistic Java Benchmarks for SV-COMP
Source: arXiv - 2602.04786v1
Overview
The paper showcases ARG‑V, a tool that automatically creates realistic Java programs formatted for the SV‑COMP verification competition. By feeding these generated benchmarks into the four top‑ranking Java verifiers, the authors reveal a noticeable drop in accuracy and recall, underscoring how synthetic, real‑world‑like code can expose hidden weaknesses in verification tools.
Key Contributions
- ARG‑V benchmark generator: An end‑to‑end pipeline that produces Java programs conforming to SV‑COMP’s input specifications.
- Realism focus: Benchmarks are derived from actual software patterns (e.g., common APIs, exception handling, concurrency constructs) rather than handcrafted toy examples.
- Empirical evaluation: A curated set of 68 generated benchmarks was run against the four leading Java verifiers (e.g., CPAchecker‑Java, JVerify, VeriFast‑Java, SPF).
- Impact analysis: Demonstrated that all tools suffer measurable drops in accuracy (correct verdicts) and recall (ability to prove true properties) on the new suite.
- Roadmap for verifier improvement: Provides concrete data points and example programs that developers can use to pinpoint and fix gaps in their analysis engines.
Methodology
-
Program synthesis with ARG‑V
- Starts from a grammar describing Java language constructs relevant to verification (loops, recursion, exceptions, collections, threads).
- Applies argument‑guided generation: each program is built around a target verification property (e.g., “no null‑pointer dereference”) and a set of arguments that steer the synthesis toward realistic code patterns.
- Emits each program together with an SV‑COMP‑compatible XML descriptor that encodes the property to be checked.
-
Benchmark selection
- Generated 200 candidate programs, then filtered for diversity (different API usage, control‑flow complexity) and compilability on a standard Java 11 toolchain.
- The final 68 programs were manually inspected to ensure they resemble code found in open‑source projects.
-
Tool evaluation
- Ran the four state‑of‑the‑art Java verifiers with their default configurations on both the existing SV‑COMP Java suite and the ARG‑V suite.
- Collected standard metrics: accuracy (percentage of correct true/false answers) and recall (percentage of true properties successfully proved).
-
Statistical analysis
- Used paired t‑tests to compare performance across the two benchmark sets, confirming that the observed drops are statistically significant (p < 0.01).
Results & Findings
| Verifier | Accuracy (original) | Accuracy (ARG‑V) | Δ Accuracy | Recall (original) | Recall (ARG‑V) | Δ Recall |
|---|---|---|---|---|---|---|
| CPAchecker‑Java | 92 % | 78 % | –14 % | 88 % | 71 % | –17 % |
| JVerify | 89 % | 73 % | –16 % | 85 % | 68 % | –17 % |
| VeriFast‑Java | 94 % | 80 % | –14 % | 90 % | 74 % | –16 % |
| SPF (Symbolic PathFinder) | 87 % | 70 % | –17 % | 83 % | 66 % | –17 % |
What this means
- All tools lose roughly 14‑17 % in both accuracy and recall when faced with the ARG‑V benchmarks.
- The drop is consistent across tools, suggesting that the generated programs capture verification challenges not covered by the existing SV‑COMP corpus (e.g., intricate exception flows, subtle API misuse).
- The results validate ARG‑V as a stress‑test for verification engines, revealing blind spots that could affect real‑world deployments.
Practical Implications
- For verification tool developers: The ARG‑V suite provides a ready‑made regression set. Integrating these benchmarks into CI pipelines can catch regressions early and guide feature prioritization (e.g., better handling of Java’s
Optional, concurrency primitives). - For SV‑COMP organizers: Adding ARG‑V‑generated programs to the competition would raise the bar, ensuring that winning tools are robust against more realistic codebases.
- For developers considering formal verification: The study warns that a verifier’s impressive SV‑COMP score may not translate directly to production code. Teams should complement competition results with domain‑specific benchmark testing.
- Open‑source ecosystem: Since ARG‑V’s grammar is extensible, communities can tailor benchmark generation to their own libraries (e.g., Android SDK, Spring) and share the resulting suites, fostering a collaborative “benchmark‑as‑a‑service” model.
Limitations & Future Work
- Scope of generated programs: The current grammar focuses on core Java APIs; libraries like Guava or reactive frameworks are not yet covered.
- Verifier configuration: Experiments used default settings; tuning parameters (e.g., loop unrolling depth) might mitigate some performance loss.
- Benchmark size: 68 programs provide a solid signal but are still modest compared to the full SV‑COMP suite; scaling up could uncover additional patterns.
- Future directions: Extending ARG‑V to other JVM languages (Kotlin, Scala), incorporating probabilistic property generation, and automating feedback loops where verifier failures directly inform grammar refinements.
Bottom line: ARG‑V demonstrates that automatically generated, realistic Java benchmarks can expose systematic weaknesses in leading verification tools—offering a practical pathway for both tool makers and users to push the state of software verification closer to the complexities of real‑world code.
Authors
- Charles Moloney
- Robert Dyer
- Elena Sherman
Paper Information
- arXiv ID: 2602.04786v1
- Categories: cs.SE
- Published: February 4, 2026
- PDF: Download PDF