[Paper] Flaky Tests in a Large Industrial Database Management System: An Empirical Study of Fixed Issue Reports for SAP HANA

Published: (February 3, 2026 at 09:03 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2602.03556v1

Overview

The paper investigates why automated tests in SAP HANA—a massive, enterprise‑grade database system—behave inconsistently (i.e., become “flaky”). By automatically labeling thousands of issue reports with their underlying root causes, the authors reveal which kinds of flakiness dominate in a real‑world, production‑scale codebase.

Key Contributions

  • LLM‑based annotation pipeline: Introduces a lightweight method that uses large language models (LLMs) to classify issue reports into flakiness root‑cause categories without manual labeling.
  • Empirical dataset: Analyzes 559 fixed‑flakiness issue reports from SAP HANA, one of the largest industrial DBMS projects publicly studied.
  • Root‑cause distribution: Finds that concurrency‑related problems account for the largest share (≈23 %) of flaky tests, with distinct patterns across different test types (unit, integration, system).
  • Actionable insight for researchers: Shows that flakiness mitigation techniques must be evaluated across test categories, not just on a single test suite.

Methodology

  1. Data collection – The authors extracted all issue reports that explicitly mention “flaky” or related keywords and that were marked as fixed in SAP HANA’s internal tracker.
  2. Label schema – They defined a taxonomy of flakiness root causes (e.g., concurrency, timing, external services, nondeterministic data).
  3. LLM annotation – Two state‑of‑the‑art LLMs (e.g., GPT‑4‑style) were prompted to assign a root‑cause label to each report. To improve reliability, the authors measured intra‑model consistency (same model’s repeatability) and inter‑model agreement (between the two models). Disagreements were resolved by a simple majority vote.
  4. Validation – A random sample of 50 reports was manually reviewed by the authors to estimate labeling accuracy, achieving > 85 % agreement with the manual ground truth.
  5. Statistical analysis – The labeled data were aggregated by test type and root cause, and significance tests were applied to highlight notable differences.

Results & Findings

Root‑cause category% of reportsNotable observations
Concurrency23 % (130/559)Most prevalent across all test types; often linked to race conditions in parallel query execution.
Timing / Scheduler15 %Flakiness due to flaky timers, sleep‑based waits, or flaky CI resource allocation.
External services12 %Tests that depend on networked components (e.g., storage back‑ends) fail intermittently.
Nondeterministic data10 %Random seeds not fixed, leading to varying query plans.
Environment setup9 %Differences in container/VM configurations cause sporadic failures.
Others / Misc.31 %Includes legacy code quirks, flaky mocks, and undocumented causes.
  • Test‑type divergence: Unit tests are more prone to timing issues, while integration/system tests suffer heavily from concurrency and external‑service problems.
  • Labeling reliability: Intra‑model consistency reached 92 %, and inter‑model agreement (Cohen’s κ) was 0.78, indicating the LLM‑based approach is robust enough for large‑scale mining.

Practical Implications

  • Prioritize concurrency fixes: Development teams working on DBMSs or any highly parallel system should invest in deterministic scheduling, lock‑free data structures, or more granular synchronization primitives.
  • Tailor flakiness detection tools: CI pipelines can be enhanced with heuristics that flag tests exhibiting the identified patterns (e.g., repeated failures under high parallel load).
  • Automated triage with LLMs: The presented annotation pipeline can be integrated into issue‑tracking workflows to auto‑categorize new flaky‑test tickets, reducing manual triage effort.
  • Test‑type aware mitigation: Teams should adopt different strategies per test tier—e.g., use mock services for unit tests, but focus on robust transaction isolation for integration tests.
  • Benchmarking flakiness solutions: Researchers and tool vendors can use the taxonomy and dataset as a benchmark to evaluate whether their fixes generalize across test categories.

Limitations & Future Work

  • Domain specificity: The study is confined to SAP HANA; results may differ for microservice‑oriented systems, mobile apps, or front‑end frameworks.
  • LLM bias: Although validation showed high agreement, the LLMs could misinterpret ambiguous issue descriptions, especially for rare root causes.
  • Static taxonomy: The root‑cause categories were predefined; emerging flakiness patterns (e.g., AI‑driven components) might not fit neatly.
  • Future directions: Extending the approach to other languages (e.g., JavaScript, Go), incorporating dynamic test‑execution data, and exploring automated remediation suggestions based on the identified root cause.

Authors

  • Alexander Berndt
  • Thomas Bach
  • Sebastian Baltes

Paper Information

  • arXiv ID: 2602.03556v1
  • Categories: cs.SE
  • Published: February 3, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »