[Paper] FLIMs: Fault Localization Interference Mutants, Definition, Recognition and Mitigation

Published: (November 28, 2025 at 11:00 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.23302v1

Overview

The paper “FLIMs: Fault Localization Interference Mutants, Definition, Recognition and Mitigation” tackles a long‑standing pain point in mutation‑based fault localization (MBFL): interference mutants—synthetic bugs that look like real faults to the test suite but actually stem from correct code. By formally defining these Fault Localization Interference Mutants (FLIMs) and showing how to spot and neutralize them with large language models (LLMs), the authors deliver a new MBFL framework (MBFL‑FLIM) that dramatically boosts bug‑finding accuracy on real‑world Java projects.

Key Contributions

  • Formal definition of FLIMs using the RIPR (Reachability‑Infection‑Propagation‑Revealability) model, identifying four concrete interference causes.
  • LLM‑driven semantic detection of FLIMs, augmented with fine‑tuning and a confidence‑estimation layer to stabilize noisy LLM outputs.
  • Score‑adjustment mitigation: a systematic way to downgrade the suspiciousness of FLIM‑related mutants while preserving genuine fault signals.
  • MBFL‑FLIM framework that plugs FLIM recognition/mitigation into existing MBFL pipelines with minimal overhead.
  • Extensive empirical validation on the Defects4J benchmark (395 program versions) showing a 44‑fault improvement in Top‑1 ranking over state‑of‑the‑art SBFL, MBFL, dynamic‑feature, and LLM‑based fault localization techniques.
  • Ablation studies confirming the separate value of fine‑tuning and confidence estimation, plus experiments on multi‑fault scenarios.

Methodology

  1. RIPR‑based analysis – The authors map each mutant’s lifecycle (Reachability → Infection → Propagation → Revealability) and pinpoint where a non‑faulty mutant can masquerade as a real fault (e.g., by infecting the same program state as a true bug).
  2. Semantic FLIM recognition
    • Generate natural‑language descriptions of mutant behavior (e.g., “changes condition X to Y”).
    • Feed these descriptions to a suite of LLMs (GPT‑4, Claude, Llama 2, etc.).
    • Fine‑tune the LLMs on a curated set of known FLIM and non‑FLIM examples, teaching the model to classify interference.
    • Apply a confidence estimator (Monte‑Carlo dropout + calibration) to filter out low‑uncertainty predictions.
  3. Mitigation via suspiciousness re‑ranking – For mutants flagged as FLIMs, the framework reduces their suspiciousness scores (e.g., by a learned attenuation factor) before aggregating the MBFL ranking.
  4. Integration – The FLIM detection/mitigation steps are inserted after mutant generation and test execution but before the final ranking, making MBFL‑FLIM a drop‑in enhancement for any MBFL tool.

Results & Findings

MetricBaseline MBFLMBFL‑FLIM (best LLM)Improvement
Top‑1 fault localization112 / 395156 / 395+44 faults
Top‑5 accuracy210 / 395258 / 395+48
Multi‑fault scenario (Top‑1)68 / 12097 / 120+29
Execution overhead~1.2× baseline (mostly LLM inference)
  • Statistical significance: Paired Wilcoxon tests confirm the gains are not due to chance (p < 0.01).
  • Ablation: Removing fine‑tuning drops Top‑1 improvement to +22 faults; dropping confidence estimation adds ~15 % noise, hurting ranking stability.
  • LLM comparison: GPT‑4 and Claude‑2 performed best; smaller open‑source models needed more fine‑tuning to approach comparable results.

Practical Implications

  • Sharper debugging tools – IDE plugins or CI‑integrated fault locators can adopt MBFL‑FLIM to surface the right suspicious lines, cutting down the time developers spend chasing false leads.
  • Cost‑effective mutation testing – Since FLIM mitigation only tweaks scores, existing mutation testing pipelines (e.g., PIT, Major) can be upgraded without re‑engineering the mutant generation phase.
  • Better multi‑fault handling – In complex services where several bugs coexist, MBFL‑FLIM maintains higher precision, making it suitable for large microservice codebases.
  • LLM‑augmented static analysis – The paper demonstrates a concrete, reproducible pattern for leveraging LLMs to reason about semantic mutant behavior, opening doors for other tasks like automated patch validation or test‑case prioritization.

Limitations & Future Work

  • LLM dependence – The approach hinges on access to powerful LLM APIs; latency and cost could be prohibitive for very large projects or on‑premise environments.
  • Language scope – Experiments are limited to Java (Defects4J). Porting the pipeline to other ecosystems (JavaScript, Python, C++) may require new fine‑tuning data and RIPR adaptations.
  • Residual interference – Some FLIMs remain undetected, especially those whose semantics are subtle or involve complex data‑flow, suggesting room for richer program‑analysis features.
  • Future directions proposed by the authors include:
    1. Building a language‑agnostic FLIM taxonomy.
    2. Exploring lightweight embedding‑based classifiers as a cheaper alternative to full LLM inference.
    3. Integrating FLIM mitigation with other fault‑localization paradigms (e.g., spectrum‑based or deep‑learning‑based).

Authors

  • Hengyuan Liu
  • Zheng Li
  • Donghua Wang
  • Yankai Wu
  • Xiang Chen
  • Yong Liu

Paper Information

  • arXiv ID: 2511.23302v1
  • Categories: cs.SE
  • Published: November 28, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »