[Paper] The Vocabulary of Flaky Tests in the Context of SAP HANA
Source: arXiv - 2602.23957v1
Overview
The paper investigates how to automatically spot flaky automated tests—tests that fail intermittently for no code change—by mining the vocabulary used in test source code. By reproducing prior work in the massive SAP HANA codebase and trying newer text‑mining and machine‑learning techniques, the authors show that high‑accuracy classifiers are possible, but the insights they produce are hard for developers to act on.
Key Contributions
- Industrial replication of Pinto et al.’s identifier‑based flaky‑test detection on the SAP HANA project, confirming that the original approach scales to a real‑world, large‑scale codebase.
- Feature‑extraction comparison between classic TF‑IDF and the more nuanced TF‑IDFC‑RF (term‑frequency, inverse‑document‑frequency, class‑relevance, and frequency‑reduction) for test‑code identifiers.
- Model evaluation using both a transformer‑based language model (CodeBERT) and a gradient‑boosted tree ensemble (XGBoost), demonstrating modest gains over the baseline.
- Empirical analysis of root causes, revealing that external data dependencies (e.g., remote services, databases) dominate the vocabulary associated with flaky tests in SAP HANA.
- Critical reflection on the practical utility of vocabulary‑based predictions, highlighting the gap between high classification scores and actionable developer guidance.
Methodology
- Data Collection – The authors gathered three labeled datasets: the original set used by Pinto et al. and two new sets extracted from SAP HANA’s test suite, each containing flaky and stable tests manually verified by engineers.
- Identifier Extraction – From each test file they parsed all source‑code identifiers (method names, variable names, class names, etc.) that could serve as textual cues.
- Feature Engineering
- TF‑IDF: classic bag‑of‑words weighting.
- TF‑IDFC‑RF: extends TF‑IDF by down‑weighting terms that appear frequently across both flaky and stable tests and up‑weighting class‑specific terms.
- Model Training – Two classifiers were trained on each feature set:
- CodeBERT – a pretrained transformer fine‑tuned on the identifier sequences.
- XGBoost – a gradient‑boosted decision‑tree model that works well with sparse, high‑dimensional text features.
- Evaluation – Standard 5‑fold cross‑validation measured precision, recall, and F1‑score. Results on the original dataset were compared with those on the SAP HANA datasets to assess transferability.
Results & Findings
| Dataset | Feature | Model | F1‑Score |
|---|---|---|---|
| Original (Pinto) | TF‑IDF | XGBoost | 0.94 |
| SAP HANA #1 | TF‑IDF | XGBoost | 0.92 |
| SAP HANA #2 | TF‑IDFC‑RF | CodeBERT | 0.99 |
| SAP HANA #2 | TF‑IDFC‑RF | XGBoost | 0.96 |
- Replication success: The baseline approach reproduces the original 0.94 F1‑score, confirming its validity in an industrial setting.
- Feature boost: TF‑IDFC‑RF consistently outperforms plain TF‑IDF, especially on the more heterogeneous SAP HANA data.
- Model edge: CodeBERT yields the highest score (0.99) when paired with the richer TF‑IDFC‑RF features, but XGBoost remains competitive with lower computational cost.
- Root‑cause vocabulary: Terms linked to remote dependencies (e.g., “http”, “service”, “db”) dominate the flaky‑test lexicon, mirroring earlier empirical studies.
Practical Implications
- Automated flaky‑test triage – Teams can plug a lightweight XGBoost classifier into CI pipelines to flag potentially flaky tests early, reducing noise in test reports.
- Targeted refactoring – The identified vocabulary hints that tests relying on external services are the biggest flakiness risk, encouraging developers to mock or isolate such dependencies.
- Tooling integration – Because the approach only needs static identifier extraction, it can be added to existing static analysis or test‑management tools without deep runtime instrumentation.
- Data‑driven test design – Organizations can periodically retrain the model on their own test corpus, adapting the vocabulary to project‑specific terminology and catching emerging flaky patterns.
Limitations & Future Work
- Actionability gap – While classifiers achieve high F1‑scores, they only output a binary flaky/not‑flaky label; they do not explain why a test is flaky, limiting developer usefulness.
- Dataset bias – The study focuses on SAP HANA, a database‑intensive system; results may differ for UI‑heavy or embedded‑software projects.
- Static analysis only – Dynamic factors (timing, thread scheduling) that cause flakiness are invisible to identifier‑based models.
- Future directions suggested include: (1) enriching the model with execution‑trace features, (2) generating natural‑language explanations for flagged tests, and (3) evaluating the approach across diverse domains to assess generalizability.
Authors
- Alexander Berndt
- Zoltán Nochta
- Thomas Bach
Paper Information
- arXiv ID: 2602.23957v1
- Categories: cs.SE
- Published: February 27, 2026
- PDF: Download PDF