[Paper] FLIMs: Fault Localization Interference Mutants, Definition, Recognition and Mitigation

Published: 1 week ago (November 28, 2025 at 11:00 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.23302v1

Overview

The paper “FLIMs: Fault Localization Interference Mutants, Definition, Recognition and Mitigation” tackles a long‑standing pain point in mutation‑based fault localization (MBFL): interference mutants—synthetic bugs that look like real faults to the test suite but actually stem from correct code. By formally defining these Fault Localization Interference Mutants (FLIMs) and showing how to spot and neutralize them with large language models (LLMs), the authors deliver a new MBFL framework (MBFL‑FLIM) that dramatically boosts bug‑finding accuracy on real‑world Java projects.

Key Contributions

Formal definition of FLIMs using the RIPR (Reachability‑Infection‑Propagation‑Revealability) model, identifying four concrete interference causes.
LLM‑driven semantic detection of FLIMs, augmented with fine‑tuning and a confidence‑estimation layer to stabilize noisy LLM outputs.
Score‑adjustment mitigation: a systematic way to downgrade the suspiciousness of FLIM‑related mutants while preserving genuine fault signals.
MBFL‑FLIM framework that plugs FLIM recognition/mitigation into existing MBFL pipelines with minimal overhead.
Extensive empirical validation on the Defects4J benchmark (395 program versions) showing a 44‑fault improvement in Top‑1 ranking over state‑of‑the‑art SBFL, MBFL, dynamic‑feature, and LLM‑based fault localization techniques.
Ablation studies confirming the separate value of fine‑tuning and confidence estimation, plus experiments on multi‑fault scenarios.

Methodology

RIPR‑based analysis – The authors map each mutant’s lifecycle (Reachability → Infection → Propagation → Revealability) and pinpoint where a non‑faulty mutant can masquerade as a real fault (e.g., by infecting the same program state as a true bug).
Semantic FLIM recognition –
- Generate natural‑language descriptions of mutant behavior (e.g., “changes condition X to Y”).
- Feed these descriptions to a suite of LLMs (GPT‑4, Claude, Llama 2, etc.).
- Fine‑tune the LLMs on a curated set of known FLIM and non‑FLIM examples, teaching the model to classify interference.
- Apply a confidence estimator (Monte‑Carlo dropout + calibration) to filter out low‑uncertainty predictions.
Mitigation via suspiciousness re‑ranking – For mutants flagged as FLIMs, the framework reduces their suspiciousness scores (e.g., by a learned attenuation factor) before aggregating the MBFL ranking.
Integration – The FLIM detection/mitigation steps are inserted after mutant generation and test execution but before the final ranking, making MBFL‑FLIM a drop‑in enhancement for any MBFL tool.

Results & Findings

Metric	Baseline MBFL	MBFL‑FLIM (best LLM)	Improvement
Top‑1 fault localization	112 / 395	156 / 395	+44 faults
Top‑5 accuracy	210 / 395	258 / 395	+48
Multi‑fault scenario (Top‑1)	68 / 120	97 / 120	+29
Execution overhead	–	~1.2× baseline (mostly LLM inference)	–

Statistical significance: Paired Wilcoxon tests confirm the gains are not due to chance (p < 0.01).
Ablation: Removing fine‑tuning drops Top‑1 improvement to +22 faults; dropping confidence estimation adds ~15 % noise, hurting ranking stability.
LLM comparison: GPT‑4 and Claude‑2 performed best; smaller open‑source models needed more fine‑tuning to approach comparable results.

Practical Implications

Sharper debugging tools – IDE plugins or CI‑integrated fault locators can adopt MBFL‑FLIM to surface the right suspicious lines, cutting down the time developers spend chasing false leads.
Cost‑effective mutation testing – Since FLIM mitigation only tweaks scores, existing mutation testing pipelines (e.g., PIT, Major) can be upgraded without re‑engineering the mutant generation phase.
Better multi‑fault handling – In complex services where several bugs coexist, MBFL‑FLIM maintains higher precision, making it suitable for large microservice codebases.
LLM‑augmented static analysis – The paper demonstrates a concrete, reproducible pattern for leveraging LLMs to reason about semantic mutant behavior, opening doors for other tasks like automated patch validation or test‑case prioritization.

Limitations & Future Work

LLM dependence – The approach hinges on access to powerful LLM APIs; latency and cost could be prohibitive for very large projects or on‑premise environments.
Language scope – Experiments are limited to Java (Defects4J). Porting the pipeline to other ecosystems (JavaScript, Python, C++) may require new fine‑tuning data and RIPR adaptations.
Residual interference – Some FLIMs remain undetected, especially those whose semantics are subtle or involve complex data‑flow, suggesting room for richer program‑analysis features.
Future directions proposed by the authors include:
1. Building a language‑agnostic FLIM taxonomy.
2. Exploring lightweight embedding‑based classifiers as a cheaper alternative to full LLM inference.
3. Integrating FLIM mitigation with other fault‑localization paradigms (e.g., spectrum‑based or deep‑learning‑based).

Authors

Hengyuan Liu
Zheng Li
Donghua Wang
Yankai Wu
Xiang Chen
Yong Liu

Paper Information

arXiv ID: 2511.23302v1
Categories: cs.SE
Published: November 28, 2025
PDF: Download PDF

[Paper] FLIMs: Fault Localization Interference Mutants, Definition, Recognition and Mitigation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Configuration Defects in Kubernetes

[Paper] POLARIS: Is Multi-Agentic Reasoning the Next Wave in Engineering Self-Adaptive Systems?

[Paper] Cross-Task Benchmarking and Evaluation of General-Purpose and Code-Specific Large Language Models

[Paper] PBFuzz: Agentic Directed Fuzzing for PoV Generation