[Paper] Statistical-Based Metric Threshold Setting Method for Software Fault Prediction in Firmware Projects: An Industrial Experience

Published: (February 6, 2026 at 11:19 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.06831v1

Overview

The paper presents a lightweight, statistically‑driven method for setting metric thresholds that can predict faulty functions in embedded firmware projects. By extracting static‑analysis metrics from one set of projects and reusing the derived thresholds on new, unrelated firmware, the authors offer an interpretable alternative to black‑box machine‑learning fault predictors—something that aligns well with safety‑critical standards such as ISO 26262.

Key Contributions

  • A repeatable threshold‑derivation process that works across independent firmware projects without retraining.
  • Statistical identification of discriminative code metrics (e.g., cyclomatic complexity, lines of code, coupling) using hypothesis testing.
  • Empirical thresholds that achieve high precision in flagging fault‑prone functions, validated on three real‑world C firmware systems.
  • A practical integration blueprint for static‑analysis tools (Coverity, Understand) within existing SQA pipelines, delivering actionable insights to developers.
  • Evidence that metric‑based prediction can meet functional‑safety compliance while remaining transparent and auditable.

Methodology

  1. Data Collection – The authors gathered source code from three industrial C‑based firmware projects and ran Coverity + Understand to compute a suite of static metrics per function (size, complexity, depth, etc.).
  2. Labeling Faulty Functions – Historical defect logs were mapped to the corresponding functions, producing a binary “faulty / clean” label.
  3. Statistical Filtering – For each metric, they performed non‑parametric hypothesis tests (Mann‑Whitney U) to check whether the distribution differs significantly between faulty and clean functions. Only metrics with a statistically significant gap were kept.
  4. Threshold Extraction – Using the empirical cumulative distribution of the selected metrics, they identified cutoff points that maximize the separation (e.g., the 75th percentile of faulty functions vs. the 25th percentile of clean ones).
  5. Cross‑Project Validation – Thresholds derived from Project A were applied to Projects B and C, and the resulting predictions were evaluated with precision, recall, and F1‑score.
  6. Interpretability Check – The final thresholds were reviewed by domain engineers to ensure they made sense in the context of firmware development practices.

Results & Findings

Metric (example)Derived ThresholdPrecision (cross‑project)
Cyclomatic Complexity> 120.84
Lines of Code per function> 450.79
Number of Pointers> 30.81
  • High precision (≈ 80 %–85 %) in flagging fault‑prone functions, meaning most alerts correspond to real defects.
  • Recall was modest (≈ 45 %–55 %), reflecting the intentional bias toward precision to avoid overwhelming developers with false alarms.
  • Cross‑project reuse succeeded: thresholds derived from one firmware baseline retained predictive power on the other two, confirming the method’s generality.
  • Interpretability: Engineers could directly read a threshold (“if a function exceeds 12 cyclomatic complexity, inspect it”) and act on it without needing to understand a hidden model.

Practical Implications

  • Immediate QA integration – Teams can embed the threshold checks into CI pipelines (e.g., as a static‑analysis gate) to catch risky functions before code review.
  • Safety‑critical compliance – Because the approach is transparent, auditors can trace why a function was flagged, satisfying ISO 26262 evidence requirements.
  • Cost‑effective defect reduction – By focusing testing and code‑review effort on the small subset of high‑risk functions, organizations can lower inspection time and reduce field failures.
  • Scalability – No need for continuous model retraining; thresholds can be refreshed periodically (e.g., quarterly) as new defect data become available.
  • Tool‑agnostic – Works with any static‑analysis suite that can export standard metrics, making it adaptable to existing toolchains (SonarQube, clang‑tidy, etc.).

Limitations & Future Work

  • Recall trade‑off – The emphasis on precision leaves many faulty functions undetected; future work could explore hybrid approaches that combine thresholds with lightweight ML classifiers to boost recall.
  • Metric set dependency – Results rely on the specific metrics extracted by Coverity/Understand; other tools may produce different values, requiring re‑validation.
  • Domain specificity – While the method transferred across three firmware projects, its applicability to vastly different codebases (e.g., high‑level applications) remains to be tested.
  • Dynamic behavior ignored – Only static code attributes were considered; incorporating runtime profiling could refine the thresholds further.

Bottom line: This research offers a pragmatic, statistically sound pathway for developers to embed fault‑prediction directly into their firmware quality‑assurance workflow—delivering the interpretability and compliance needed for safety‑critical software without the overhead of opaque AI models.

Authors

  • Marco De Luca
  • Domenico Amalfitano
  • Anna Rita Fasolino
  • Porfirio Tramontana

Paper Information

  • arXiv ID: 2602.06831v1
  • Categories: cs.SE
  • Published: February 6, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »