[Paper] Systematic Evaluation of Black-Box Checking for Fast Bug Detection

Published: 1 week ago (December 8, 2025 at 06:10 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.07434v1

Overview

The paper presents the first large‑scale, systematic study of Black‑Box Checking (BBC) – a technique that interleaves automata learning, model‑based testing, and model checking on every intermediate hypothesis. By evaluating BBC on 77 real‑world protocol and controller benchmarks, the authors show that it can uncover safety violations dramatically faster than traditional model‑based testing pipelines that only apply model checking at the end.

Key Contributions

Comprehensive empirical evaluation of BBC on a diverse set of 77 benchmark systems with known safety properties.
Quantitative evidence that BBC needs only ~3 % of the queries compared to a “learn‑then‑check” approach when the full model is learnable.
Demonstration that BBC remains effective even when the full model cannot be learned, detecting 94 % of safety violations in the challenging RERS 2019 industrial LTL suite.
Comparison with state‑of‑the‑art MBT algorithms, showing BBC’s superior ability to expose deep, hard‑to‑find bugs.
Open‑source implementation and a reproducible experimental setup for the community.

Methodology

Benchmark Selection – 77 systems drawn from real network protocol implementations and embedded controllers, each equipped with a set of safety properties expressed in LTL.
Black‑Box Checking Loop
- Active automata learning (e.g., L* or its variants) builds a hypothesis model from input/output queries.
- Model checking is run immediately on each hypothesis against the safety specifications.
- If a counterexample is found, it is turned into a concrete test case that is executed on the actual implementation, exposing a bug.
- Otherwise, the learning algorithm refines the hypothesis and the loop repeats.
Baselines – Two reference pipelines:
- Learn‑then‑check: learn the full model first, then run model checking once.
- Standard MBT: model‑based testing without systematic model checking on intermediate hypotheses.
Metrics – Number of queries (learning + testing), time to first bug detection, and bug‑coverage (percentage of safety violations discovered).
Tooling – The authors integrated existing learning libraries (LearnLib), model checkers (NuSMV/Spot), and test harnesses, releasing the whole stack as open source.

Results & Findings

Scenario	Queries Needed (BBC vs. Learn‑then‑Check)	Bug‑Coverage	Notable Observation
Full model learnable	3.4 % of queries	100 % (all known violations)	BBC finds bugs early, often after just a few learning iterations.
Full model not learnable	~5‑10 % of queries (approx.)	94 % of safety violations (RERS 2019)	Even incomplete hypotheses still expose deep bugs.
Comparison with MBT	5‑15× fewer queries	30‑70 % lower coverage (depending on benchmark)	BBC’s systematic checking of intermediate models is the key advantage.

The experiments also revealed that BBC excels at discovering deep bugs – violations that require long input sequences or intricate state interactions – which are typically missed by conventional MBT approaches.

Practical Implications

Faster QA cycles: Developers can detect protocol or controller bugs after a fraction of the test effort, shortening regression testing and release timelines.
Reduced test generation cost: Since BBC needs far fewer queries, the computational and time overhead of generating exhaustive test suites drops dramatically.
Early feedback for developers: Bugs surface during the learning phase, allowing engineers to pinpoint problematic behaviours before the system is fully understood.
Applicability to legacy black‑box systems: Even when the exact internal model cannot be learned (e.g., proprietary firmware), BBC still yields high bug‑coverage, making it attractive for security audits and compliance testing.
Integration into CI pipelines: The open‑source implementation can be wrapped as a step in continuous integration, automatically checking new builds against safety specs with minimal manual effort.

Limitations & Future Work

Scalability to extremely large state spaces: While BBC reduces query count, the underlying model checking step can still become a bottleneck for systems with millions of states.
Dependence on quality of specifications: The approach assumes accurate safety properties; incomplete or overly permissive specs may hide bugs.
Learning algorithm sensitivity: The effectiveness hinges on the chosen learning algorithm; exploring more robust or probabilistic learners could improve results on non‑learnable models.
Extension to liveness properties: The study focused on safety; adapting BBC to handle liveness or performance constraints remains an open challenge.
Industrial adoption studies: Future work includes case studies with large software vendors to assess integration overhead and real‑world ROI.

Authors

Bram Pellen
María Belén Rodríguez
Frits Vaandrager
Petra van den Bos

Paper Information

arXiv ID: 2512.07434v1
Categories: cs.SE, cs.FL
Published: December 8, 2025
PDF: Download PDF

[Paper] Systematic Evaluation of Black-Box Checking for Fast Bug Detection

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] A High-level Synthesis Toolchain for the Julia Language

[Paper] WuppieFuzz: Coverage-Guided, Stateful REST API Fuzzing

[Paper] A Container-based Approach For Proactive Asset Administration Shell Digital Twins

[Paper] Insecure Ingredients? Exploring Dependency Update Patterns of Bundled JavaScript Packages on the Web