[Paper] LLM vs. Human Unit Tests: Fault Detection on Real Python Bugs

Published: 3 days ago (June 7, 2026 at 08:01 AM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.08588v1

Overview

Large language models (LLMs) have shown considerable promise for automated unit test generation, yet their practical effectiveness relative to human-written tests remains poorly understood. Existing evaluations commonly rely on coverage-oriented benchmarks that do not assess fault-detection capability directly. We present an empirical comparison of LLM-generated and human-written unit tests across three complementary Python benchmarks: 29 real historical bugs from BugsInPy, a function-level benchmark drawn from python-slugify and packaging, and a controlled paired benchmark. Our generation pipeline couples Gemini 2.5 Flash with a lightweight lexical retrieval mechanism that supplies bug-relevant context at generation time. Across eight quality dimensions, LLM-generated tests with retrieval-augmented context detect faults in 69% of cases compared to 17.2% for general-purpose human-written tests (Fisher’s exact, $p < 0.001$, Cohen’s $h = 1.10$). Critically, line and branch coverage are nearly identical between the two approaches (84.8% vs. 88.5% and 75.2% vs. 82.1%), confirming that coverage is an insufficient proxy for fault-detection capability. We discuss the conditions under which each approach excels, characterize their complementary strengths, and identify the critical role of retrieval context and reproducible benchmark construction in meaningful test-quality evaluation.

Key Contributions

This paper presents research in the following areas:

cs.SE

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.SE.

Authors

Phouvadeth Vathana
Prapti Bhatt
Rishi Patel
Nasir U. Eisty

Paper Information

arXiv ID: 2606.08588v1
Categories: cs.SE
Published: June 7, 2026
PDF: Download PDF

[Paper] LLM vs. Human Unit Tests: Fault Detection on Real Python Bugs

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Operationalizing Property-Based Testing for Data-Intensive Scalable Computing Systems

[Paper] Making Software Meaningful

[Paper] GapFuzz: Cross-Plane Divergence Fuzzing for Distributed SDN Controllers

[Paper] Context-Based Adversarial Attacks on AI Code Generators: Vulnerability Analysis and Implications