[Paper] Teralizer: Semantics-Based Test Generalization from Conventional Unit Tests to Property-Based Tests
Source: arXiv - 2512.14475v1
Overview
The paper introduces Teralizer, a prototype that automatically turns ordinary JUnit unit tests into richer property‑based tests for Java. By analyzing the semantics of the code under test, Teralizer extracts the underlying specifications and generates jqwik property tests, aiming to close the gap between the narrow coverage of classic unit tests and the exhaustive input exploration of property‑based testing.
Key Contributions
- Semantics‑driven test generalization: Unlike prior approaches that infer properties only from input‑output examples, Teralizer leverages single‑path symbolic execution to derive specifications directly from program semantics.
- End‑to‑end prototype for Java: Implements a pipeline that parses JUnit tests, performs symbolic analysis on the target method, synthesizes jqwik property tests, and integrates them back into the project.
- Empirical evaluation across three datasets:
- EvoSuite‑generated tests for EqBench and Apache Commons utilities (synthetic but realistic).
- Mature developer‑written tests from Apache Commons utilities (real‑world).
- A large‑scale scan of 632 open‑source Java projects (RepoReapers) to assess practical applicability.
- Quantitative impact on mutation scores: Shows modest but consistent improvements (1–4 pp) on synthetic datasets and a tiny gain (≈0.06 pp) on mature test suites.
- Roadmap for future research: Identifies concrete engineering hurdles (type support, static analysis precision) and outlines steps to make test generalization broadly usable.
Methodology
- Input collection – Teralizer starts with an existing JUnit test class and the corresponding production code.
- Single‑path symbolic analysis – For each test method, the tool symbolically executes the target method along the concrete execution path exercised by the test. This yields symbolic constraints on inputs and a semantic description of the observed behavior (e.g., “output equals
a + b”). - Specification extraction – The symbolic constraints are transformed into property predicates (pre‑conditions, post‑conditions, invariants).
- Property‑based test synthesis – Using the jqwik API, Teralizer generates a property test that randomly (or systematically) samples inputs satisfying the extracted predicates, then asserts the same semantic relationship discovered in step 2.
- Integration & validation – The generated property tests are compiled alongside the original suite and run to compute mutation scores and other coverage metrics.
The whole pipeline is automated, requiring only the original JUnit test as input; developers do not need to write any additional property specifications.
Results & Findings
| Dataset | Baseline mutation score | After Teralizer | Δ (percentage points) |
|---|---|---|---|
| EvoSuite tests for EqBench | 78.2 % | 81.5 % | +3.3 pp |
| EvoSuite tests for Apache Commons utilities | 84.7 % | 86.9 % | +2.2 pp |
| Developer‑written Apache Commons tests | 92.4 % | 92.45 % | +0.05 pp |
| RepoReapers scan (632 projects) | – | Successful pipeline on 1.7 % of projects | – |
Key takeaways
- Semantic generalization yields measurable mutation‑score gains on automatically generated test suites, confirming that the derived properties expose faults missed by the original concrete tests.
- Mature hand‑crafted tests already capture most useful properties, so the incremental benefit is small but still positive.
- Scalability is limited: only a tiny fraction of real‑world projects could be processed end‑to‑end, mainly due to unsupported Java language features (generics, lambdas) and static analysis gaps.
Practical Implications
- Boosting legacy test suites: Teams can run Teralizer on existing JUnit tests to automatically generate property‑based counterparts, gaining extra confidence without writing new specifications from scratch.
- Improving mutation‑testing pipelines: Adding the generated property tests can raise mutation scores, helping developers detect weak spots in their test coverage.
- Facilitating a gradual adoption of property‑based testing: Instead of a “big‑bang” switch to jqwik or QuickCheck, developers can incrementally enrich their suites, lowering the learning curve.
- Tooling integration opportunities: IDE plugins or CI‑CD steps could invoke Teralizer as a “test‑enhancement” stage, automatically surfacing newly discovered failing inputs for developers to review.
- Guidance for test‑generation research: The paper’s roadmap highlights concrete engineering work (e.g., richer type handling, multi‑path analysis) that could make such automation viable for large codebases.
Limitations & Future Work
- Type and language feature support: The current prototype struggles with generics, var‑args, lambdas, and certain third‑party libraries, limiting its applicability to modern Java projects.
- Single‑path analysis: Only the concrete execution path exercised by the original test is generalized; exploring multiple paths could uncover richer properties.
- Static analysis precision: Over‑approximation or missed dependencies sometimes cause the pipeline to abort.
- Scalability: The 1.7 % success rate on the RepoReapers corpus indicates that substantial engineering effort is needed before the approach can be used at scale.
Future work outlined by the authors includes extending symbolic execution to handle full Java semantics, integrating multi‑path exploration, and building tighter IDE/CI integrations to surface generated properties to developers in a usable form.
Authors
- Johann Glock
- Clemens Bauer
- Martin Pinzger
Paper Information
- arXiv ID: 2512.14475v1
- Categories: cs.SE
- Published: December 16, 2025
- PDF: Download PDF