[Paper] Coverage Isn't Enough: SBFL-Driven Insights into Manually Created vs. Automatically Generated Tests

Published: 1 month ago (December 11, 2025 at 09:07 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.11223v1

Overview

Testing is a bottleneck in modern software development, and automated test generators promise to ease the load. This paper pits manually written tests against automatically generated ones—but instead of stopping at traditional coverage numbers, the authors also measure how well each suite supports Spectrum‑Based Fault Localization (SBFL). The findings reveal a trade‑off: generated tests cover more branches, yet they are less helpful for pinpointing bugs, especially in deeply nested code.

Key Contributions

Dual‑metric evaluation: Introduces SBFL score as a complementary metric to code‑coverage when assessing automatically generated tests.
Empirical comparison: Benchmarks manually created tests vs. tests produced by state‑of‑the‑art generation tools across a diverse set of open‑source projects.
Insight on code structure: Shows that the SBFL advantage of manual tests diminishes for flat code but drops sharply for deeply nested control flow.
Guidelines for test strategy: Provides concrete recommendations on how to blend manual and generated tests to maximize both coverage and fault‑localization effectiveness.

Methodology

Subject programs – A curated collection of Java projects (≈30 k LOC total) spanning simple utilities to complex libraries.
Test suites
- Manual: Existing developer‑written JUnit tests from the projects’ repositories.
- Generated: Tests produced by two popular automated generators (e.g., EvoSuite and Randoop) using default configurations.
Fault injection – Mutants are created with a mutation testing tool (e.g., PIT) to simulate realistic bugs.
Metrics
- Branch coverage: Percentage of conditional branches exercised by each suite.
- SBFL score: Computed with a standard SBFL formula (e.g., Ochiai) that ranks statements by suspiciousness; the score reflects how quickly the true fault appears in the ranked list.
Analysis – Results are aggregated per project and stratified by nesting depth of the mutated code (shallow vs. deep). Statistical tests (Wilcoxon signed‑rank) verify significance.

Results & Findings

Metric	Manual Tests	Generated Tests
Average branch coverage	71 %	84 %
Average SBFL rank (lower is better)	3.2	7.9
Deeply nested code (≥3 levels)	2.8	12.4
Shallow code (≤2 levels)	3.5	6.1

Higher coverage, lower localization – Generated tests consistently hit more branches, but the SBFL rankings are worse, meaning developers would need to examine more statements to locate a fault.
Nesting depth matters – The SBFL gap widens dramatically for code with deep conditional nesting; generated tests often produce many “no‑op” executions that dilute suspiciousness signals.
Statistical significance – The differences are significant (p < 0.01) across all projects.

Practical Implications

Hybrid testing pipelines – Use automated generators to boost coverage quickly, then supplement with targeted manual tests (or focused property‑based tests) around complex, nested modules to improve fault localization.
Tooling enhancements – Test generators could incorporate SBFL‑aware heuristics, e.g., biasing test creation toward exercising distinct execution paths in nested structures.
Debugging workflow – Teams relying on SBFL tools (e.g., GZoltar, FaultTracer) should not assume high coverage alone guarantees effective localization; manual test investment remains valuable for “hot‑spot” code.
CI/CD integration – Automated test generation can be run as a coverage‑boosting step, while a separate “localization quality gate” checks SBFL scores and flags modules that need additional handcrafted tests.

Limitations & Future Work

Tool selection – Only two generators were evaluated; results may differ with newer AI‑driven or constraint‑solving generators.
Mutation realism – Mutants approximate real bugs but may not capture all fault patterns seen in production.
Language scope – The study focuses on Java; extending to other ecosystems (e.g., JavaScript, Rust) could reveal different trade‑offs.
SBFL variants – Only Ochiai‑style SBFL was used; exploring other ranking formulas or hybrid fault‑localization techniques could refine the insights.

Bottom line: Automated test generation is a powerful coverage enhancer, but developers should pair it with manual testing—especially for complex, nested code—to keep fault localization sharp and debugging efficient.

Authors

Sasara Shimizu
Yoshiki Higo

Paper Information

arXiv ID: 2512.11223v1
Categories: cs.SE
Published: December 12, 2025
PDF: Download PDF

[Paper] Coverage Isn't Enough: SBFL-Driven Insights into Manually Created vs. Automatically Generated Tests

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] A Study of Library Usage in Agent-Authored Pull Requests

[Paper] Mini-SFC: A Comprehensive Simulation Framework for Orchestration and Management of Service Function Chains

[Paper] AutoFSM: A Multi-agent Framework for FSM Code Generation with IR and SystemC-Based Testing

[Paper] Visualisation for the CIS benchmark scanning results